In this notebook, we'll be working with the Chinook database. This database contains information about a fictional digital music store.
By querying this database using sqlite, we'll attempt to obtain data driven answers to the following business questions:
After querying the database to generate tables answering the above questions, we concluded that:
To begin, we'll setup sqlite to run properly in this notebook.
%%capture
%load_ext sql
%sql sqlite:///data/chinook.db
Now, we write a short query that gives us the name and type of any table or view in our database. Notice that before running this query, we drop two views (which are created later on in this notebook). This will solve any issues that may arise from running certain code-cells multiplie times.
%%sql
DROP VIEW IF EXISTS order_by_country;
DROP VIEW IF EXISTS stats_by_country;
SELECT
name,
type
FROM sqlite_master
WHERE type IN ("table","view");
* sqlite:///data/chinook.db Done. Done. Done.
name | type |
---|---|
album | table |
artist | table |
customer | table |
employee | table |
genre | table |
invoice | table |
invoice_line | table |
media_type | table |
playlist | table |
playlist_track | table |
track | table |
We see the name of each tabe. We can check our Schema diagram above that all tables are accounted for. Now, we can begin writing some queries to answer our questions.
In the following query, we generate a table which answers the following questions:
%%sql
/* Use a with clause to generate a temprorary table
contains genre, dollar amount of tracks_sold */
WITH tracks_by_genre AS
(
SELECT
g.name AS genre,
CAST(SUM(il.quantity) AS float) AS tracks_sold /* cast as float for future percentage calculation */
FROM invoice_line AS il
/* join genre to invoice_line through track table (see Schema) */
INNER JOIN
track AS t
ON t.track_id = il.track_id
INNER JOIN
genre AS g
ON g.genre_id = t.genre_id
GROUP BY genre
)
/* Main query */
SELECT
genre,
CAST(tracks_sold AS int) as tracks_sold,
ROUND(100*tracks_sold/(SELECT SUM(tracks_sold)
FROM tracks_by_genre)
,2) AS percentage
FROM tracks_by_genre
ORDER BY tracks_sold DESC
* sqlite:///data/chinook.db Done.
genre | tracks_sold | percentage |
---|---|---|
Rock | 2635 | 55.39 |
Metal | 619 | 13.01 |
Alternative & Punk | 492 | 10.34 |
Latin | 167 | 3.51 |
R&B/Soul | 159 | 3.34 |
Blues | 124 | 2.61 |
Jazz | 121 | 2.54 |
Alternative | 117 | 2.46 |
Easy Listening | 74 | 1.56 |
Pop | 63 | 1.32 |
Electronica/Dance | 55 | 1.16 |
Classical | 47 | 0.99 |
Reggae | 35 | 0.74 |
Hip Hop/Rap | 33 | 0.69 |
Heavy Metal | 8 | 0.17 |
Soundtrack | 5 | 0.11 |
TV Shows | 2 | 0.04 |
Drama | 1 | 0.02 |
Here, we see that Rock is by far the most popular genre, accounting for over 50% of the total number of tracks sold. Since a high percentage of these purchases likely correspond to older tracks, we reccomend searching for upcoming artists in the second & third most popular genres of Metal and Alternative/Punk.
Next, we'll take a look at how emplyee perforamnce can be analyzed by looking at the employee, customer, and invoice tables.
Using the Schema, we can relate each customer to an employee who represents him/her. Hence, after relating each invoice to the customer who made the purchase, we can sum the total number of revenue associated with each sales representative. We do this with the following query.
%%sql
SELECT
e.first_name || " " || e.last_name AS employee_name,
SUM(i.total) AS total_sales, /* total from invoices (to be grouped by employee name) */
e.hire_date
FROM
invoice AS i
LEFT JOIN
customer AS c
ON c.customer_id = i.customer_id
LEFT JOIN
employee AS e
ON e.employee_id = c.support_rep_id /* left join each customer to their employee representative */
GROUP BY employee_name;
* sqlite:///data/chinook.db Done.
employee_name | total_sales | hire_date |
---|---|---|
Jane Peacock | 1731.5100000000039 | 2017-04-01 00:00:00 |
Margaret Park | 1584.0000000000034 | 2017-05-03 00:00:00 |
Steve Johnson | 1393.920000000002 | 2017-10-17 00:00:00 |
From here we see that total sales is similar among each employee. Jane Peacock has the most, but she was also hired the earliest. After correcting for time spent at the company, it seems each employee is performing similarly.
To investigate if any countries should be specifically marketed towards, we write a query which tells us number of customers, sales, as well as other factors, for each country.
To begin, we'll create two views:
%%sql
DROP VIEW IF EXISTS order_by_country;
CREATE VIEW order_by_country AS
SELECT
i.total,
c.customer_id,
c.first_name || " " || c.last_name AS customer_name,
c.country
FROM
invoice AS i
LEFT JOIN
customer AS c
ON c.customer_id = i.customer_id;
DROP VIEW IF EXISTS stats_by_country;
CREATE VIEW stats_by_country AS
SELECT
country,
COUNT(DISTINCT customer_id) AS number_of_customers,
ROUND(SUM(total),2) AS total_sales,
COUNT(customer_id) AS number_of_orders,
ROUND(SUM(total)/(SELECT CAST(
COUNT(DISTINCT customer_id) AS float)
FROM order_by_country
GROUP BY country),2)
AS sales_per_customer
FROM order_by_country
GROUP BY country;
* sqlite:///data/chinook.db Done. Done. Done. Done.
[]
With these views created, we can retrieve the data we need. A quick look at the above views reveal that many countries only have a single customer. We want to group all countries with only 1 customer into their own category, "other."
In the following query, we retrieve the information we need, while grouping countries with only a single customer into their own row.
%%sql
SELECT
country,
number_of_customers,
total_sales,
sales_per_customer,
sales_per_order
FROM
(
SELECT
country,
number_of_customers,
total_sales,
total_sales/number_of_customers AS sales_per_customer,
ROUND(total_sales/number_of_orders,2) AS sales_per_order,
"A" AS sorting
FROM stats_by_country
WHERE number_of_customers > 1
UNION
SELECT
"Other",
SUM(number_of_customers),
SUM(total_sales),
SUM(total_sales)/SUM(number_of_customers),
SUM(ROUND(total_sales/number_of_orders,2)) AS sales_per_order,
"B" AS sorting
FROM stats_by_country
WHERE number_of_customers = 1
)
ORDER BY sorting, total_sales DESC
* sqlite:///data/chinook.db Done.
country | number_of_customers | total_sales | sales_per_customer | sales_per_order |
---|---|---|---|---|
USA | 13 | 1040.49 | 80.03769230769231 | 7.94 |
Canada | 8 | 535.59 | 66.94875 | 7.05 |
Brazil | 5 | 427.68 | 85.536 | 7.01 |
France | 5 | 389.07 | 77.814 | 7.78 |
Germany | 4 | 334.62 | 83.655 | 8.16 |
Czech Republic | 2 | 273.24 | 136.62 | 9.11 |
United Kingdom | 3 | 245.52 | 81.84 | 8.77 |
Portugal | 2 | 185.13 | 92.565 | 6.38 |
India | 2 | 183.15 | 91.575 | 8.72 |
Other | 15 | 1094.94 | 72.99600000000001 | 111.66 |
We see that a plurality of sales come from the "Other" category. This suggests that marketing towards any specific country may not be beneficial. However, Czech Republic, Portugal, and India have high values of sales_per_customer.
Since the sample size for each country is so small, we recommend to not allocate too many resources towards targetting these countries specifically. Running smaller campaigns for these countries may allow a slight increase in sample size, which would give us a better idea as to whether or not larger campaigns would be useful.
To determine whether the company should add an album purchase feature, we would like to know the percentage of invoices which correspond to an album purchase. To do this, we use the following method:
This idea is implemented in the following query.
%%sql
WITH
invoice_album AS
(
SELECT il.invoice_id,
il.invoice_line_id,
il.track_id,
t.album_id
FROM
invoice_line il
LEFT JOIN
track t
ON t.track_id = il.track_id
),
album_lengths AS
(
SELECT album_id,
COUNT(DISTINCT track_id) AS album_length
FROM track
GROUP BY album_id
)
SELECT album_purchase,
COUNT(invoice_id) AS number_of_invoices,
ROUND(
CAST(
COUNT(
invoice_id) AS float)/(SELECT
COUNT(invoice_id) FROM invoice),3) AS percent
FROM
(
SELECT ia.invoice_id,
COUNT(DISTINCT ia.album_id) AS num_albums,
COUNT(DISTINCT ia.track_id) AS distinct_songs,
MAX(ia.album_id) AS album_id,
al.album_length,
"yes" AS album_purchase
FROM invoice_album ia
LEFT JOIN album_lengths al
ON al.album_id = ia.album_id
GROUP BY invoice_id
HAVING (num_albums = 1) AND (distinct_songs = album_length)
UNION
SELECT ia.invoice_id,
COUNT(DISTINCT ia.album_id) AS num_albums,
COUNT(DISTINCT ia.track_id) AS distinct_songs,
MAX(ia.album_id) AS album_id,
al.album_length,
"no" AS album_purchase
FROM invoice_album ia
LEFT JOIN album_lengths al
ON al.album_id = ia.album_id
GROUP BY invoice_id
HAVING (num_albums != 1) OR (distinct_songs != album_length)
)
GROUP BY album_purchase
* sqlite:///data/chinook.db Done.
album_purchase | number_of_invoices | percent |
---|---|---|
no | 500 | 0.814 |
yes | 114 | 0.186 |
Based on this data, it seems that adding an album purchase option would be a worthwhile investment since it would make the purchasing process for 1/5 of all purchases much simpler.