Chunk Processing Loan Data with Pandas & SQL¶

Sometimes, data engineers have to handle data sets which are too large to fit into memory at once. The workflow for cleaning and analyzing the data in this case becomes slightly more complicated. In this notebook, we'll demonstrate how this can be done using chunk processing in pandas. Thorughout this notebook, we'll assume an artifical memory limit of 10MB to demonstrate working with memory constraints.

Summary¶

After reading in the data in chunks of 5,000 rows (each of which consumes about 5MB, or 50% of our memory total), we were able to:

  • Drop irrelevate/redundant columns
  • Downcast columns to space efficient subtypes
  • Convert string columns with only a few unique answers to categorical columns

After completing the above steps, we decreased the total memory footprint of the data from 65MB to only 22MB! This optimization allows for the data analysis process to go much more smoothly, since calculations done with only a few selected columns will likely be within the artificial memory constraint.

About the Data¶

The dataset used throughout this notebook can be found here. This dataset comes from Lending Club, which is a service that connects those seeking personal loans with potential investors. Since the focus of this notebook is demonstrating a workflow using pandas that scales well with large amounts of data, we omit a data dictionary for this project.

Initial Memory Footprint and Datatypes¶

To begin, we'll read in the first five rows just to get a peak at the dataset.

In [ ]:
import pandas as pd
pd.options.display.max_columns = 99     # Let's us see all columns

preview = pd.read_csv('data/loans_2007.csv', nrows = 5)
preview
Out[ ]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med policy_code application_type acc_now_delinq chargeoff_within_12_mths delinq_amnt pub_rec_bankruptcies tax_liens
0 1077501 1296599.0 5000.0 5000.0 4975.0 36 months 10.65% 162.87 B B2 NaN 10+ years RENT 24000.0 Verified Dec-2011 Fully Paid n credit_card Computer 860xx AZ 27.65 0.0 Jan-1985 1.0 3.0 0.0 13648.0 83.7% 9.0 f 0.00 0.00 5863.155187 5833.84 5000.00 863.16 0.00 0.00 0.00 Jan-2015 171.62 Jun-2016 0.0 1.0 INDIVIDUAL 0.0 0.0 0.0 0.0 0.0
1 1077430 1314167.0 2500.0 2500.0 2500.0 60 months 15.27% 59.83 C C4 Ryder < 1 year RENT 30000.0 Source Verified Dec-2011 Charged Off n car bike 309xx GA 1.00 0.0 Apr-1999 5.0 3.0 0.0 1687.0 9.4% 4.0 f 0.00 0.00 1008.710000 1008.71 456.46 435.17 0.00 117.08 1.11 Apr-2013 119.66 Sep-2013 0.0 1.0 INDIVIDUAL 0.0 0.0 0.0 0.0 0.0
2 1077175 1313524.0 2400.0 2400.0 2400.0 36 months 15.96% 84.33 C C5 NaN 10+ years RENT 12252.0 Not Verified Dec-2011 Fully Paid n small_business real estate business 606xx IL 8.72 0.0 Nov-2001 2.0 2.0 0.0 2956.0 98.5% 10.0 f 0.00 0.00 3005.666844 3005.67 2400.00 605.67 0.00 0.00 0.00 Jun-2014 649.91 Jun-2016 0.0 1.0 INDIVIDUAL 0.0 0.0 0.0 0.0 0.0
3 1076863 1277178.0 10000.0 10000.0 10000.0 36 months 13.49% 339.31 C C1 AIR RESOURCES BOARD 10+ years RENT 49200.0 Source Verified Dec-2011 Fully Paid n other personel 917xx CA 20.00 0.0 Feb-1996 1.0 10.0 0.0 5598.0 21% 37.0 f 0.00 0.00 12231.890000 12231.89 10000.00 2214.92 16.97 0.00 0.00 Jan-2015 357.48 Apr-2016 0.0 1.0 INDIVIDUAL 0.0 0.0 0.0 0.0 0.0
4 1075358 1311748.0 3000.0 3000.0 3000.0 60 months 12.69% 67.79 B B5 University Medical Group 1 year RENT 80000.0 Source Verified Dec-2011 Current n other Personal 972xx OR 17.94 0.0 Jan-1996 0.0 15.0 0.0 27783.0 53.9% 38.0 f 461.73 461.73 3581.120000 3581.12 2538.27 1042.85 0.00 0.00 0.00 Jun-2016 67.79 Jun-2016 0.0 1.0 INDIVIDUAL 0.0 0.0 0.0 0.0 0.0

We have a few observations right off the bat:

  • datetime columns: issue_d, earliest_cr_line, last_pymnt_d, last_pymnt_amnt
  • The term and int_rate columns are likely stored as strings, but can be converted to numeric.

Let's use chunk processing to compute the memory footprint of the entire dataset. We'll use that 1 Megabyte (MB) is equal to $1024^2$ bytes.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)

total = 0
for i,chunk in enumerate(chunk_iter):
    chunk_mem = chunk.memory_usage(deep=True).sum()/(1024*1024)     # Convert memory to MBs
    total += chunk_mem
    print(f"chunk {i}: {round(chunk_mem,3)} MB")

print(f"total memory: {round(total,3)} MB")
chunk 0: 4.886 MB
chunk 1: 4.881 MB
chunk 2: 4.884 MB
chunk 3: 4.883 MB
chunk 4: 4.881 MB
chunk 5: 4.881 MB
chunk 6: 4.882 MB
chunk 7: 4.883 MB
chunk 8: 4.881 MB
chunk 9: 4.891 MB
chunk 10: 4.891 MB
chunk 11: 4.903 MB
chunk 12: 5.128 MB
chunk 13: 1.518 MB
total memory: 65.274 MB

We see that using a 3200 row chunksize ensures that the maximum memory footprint among all the chunks is around 50% of our total memory requirement. We'll use this chunksize for the remainder of the notebook. Let's check which datatypes are present throughout the chunks.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)

dtypes = [chunk.dtypes.value_counts().index.tolist() for chunk in chunk_iter]
dtypes
Out[ ]:
[[dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O'), dtype('int64')],
 [dtype('float64'), dtype('O')],
 [dtype('float64'), dtype('O')]]

We see that the datatypes among the last two chunks differs from the other chunks. Let's take a closer look by seeing how many columns of each type there are per chunk.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
strings = []
floats = []
ints = []

for chunk in chunk_iter:
    strings.append(chunk.select_dtypes(include='O').shape[1])       # .shape[1] selects number of columns of DataFrame 
    floats.append(chunk.select_dtypes(include='float64').shape[1])  # of each dtype
    ints.append(chunk.select_dtypes(include='int64').shape[1])
print(f"Object cols: {strings}")
print(f"Float cols: {floats}")
print(f"Int cols: {ints}")
Object cols: [21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]
Float cols: [30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]
Int cols: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]

We see that in the last two chunks, one of the integer columns is being interpreted as an object column. Let's see which column this is.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)


for i,chunk in enumerate(chunk_iter):
    if i == 1:
        initial_obj_set = set(chunk.select_dtypes(include='O').columns.tolist())    # set of objcet columns for most chunks
    
    elif i == 13:
        final_obj_set = set(chunk.select_dtypes(include='O').columns.tolist())      # set of object columns for final chunk

col = final_obj_set.difference(initial_obj_set)     # get extra object column
col
Out[ ]:
{'id'}

Interestingly, the id column is being read in as an object in the final two rows. Looking at the tail of the last chunk reveals what is occuring.

In [ ]:
chunk.tail(3)
Out[ ]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med policy_code application_type acc_now_delinq chargeoff_within_12_mths delinq_amnt pub_rec_bankruptcies tax_liens
42535 70686 70681.0 5000.0 5000.0 0.0 36 months 7.75% 156.11 A A3 Homemaker 10+ years MORTGAGE 70000.0 Not Verified Jun-2007 Does not meet the credit policy. Status:Fully ... n other Aroundthehouse 068xx CT 8.81 NaN NaN NaN NaN NaN 0.0 NaN NaN f 0.0 0.0 5619.76209 0.0 5000.0 619.76 0.0 0.0 0.0 Jun-2010 156.39 Feb-2015 NaN 1.0 INDIVIDUAL NaN NaN NaN NaN NaN
42536 Total amount funded in policy code 1: 471701350 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
42537 Total amount funded in policy code 2: 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

We see that the last two rows are erroneous and do not fit the file format. We can avoid this error by dropping the last two rows of the final chunk. In some cases, we may want to make sure this information is still documented. However, this step is not neccesary for this notebook.

Looking at the number of int cols for the second to last chunk, it seems that a similar phenomenon is occuring here. Let's take a closer look.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)

for i,chunk in enumerate(chunk_iter):
    if i == 13:
        chunk.drop(chunk.tail(2).index, inplace=True)       # drop erroneous rows on final chunk
        chunk["id"] = chunk["id"].astype('int64')           # fix dtype 
        obj_set = set(chunk.select_dtypes(include='O'))

    elif i == 12:
        second_to_last = chunk
        obj_set_wrong = set(chunk.select_dtypes(include='O'))

col = obj_set_wrong.difference(obj_set)
col
Out[ ]:
{'id'}

We suspect from the previous analysis that one row, where the id is a string, is erroneous. Let's see which row this is using regex to match rows that don't contain only digits.

In [ ]:
second_to_last.loc[~second_to_last["id"].str.contains('[0-9]+'),:]  # match rows which do not contain integer in id column
Out[ ]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan purpose title zip_code addr_state dti delinq_2yrs earliest_cr_line inq_last_6mths open_acc pub_rec revol_bal revol_util total_acc initial_list_status out_prncp out_prncp_inv total_pymnt total_pymnt_inv total_rec_prncp total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_d last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med policy_code application_type acc_now_delinq chargeoff_within_12_mths delinq_amnt pub_rec_bankruptcies tax_liens
39786 Loans that do not meet the credit policy NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

This indicates that rows below this value do not meet the credit policy. We could create a seperate column keeping track of this. However, this is not needed in our case. In fact, since the id does not contain relevant information, we simultaneously fix the problems with the last two chunks by simply dropping the id column for each chunk. We do this throughout the remainder of the notebook.

Now, let's determine if many columns contain multiple repeated values. We'll count the percentage of unique values in each column, and see which columns contain less than 40% unique values. These columns will be candiates for converting to a categorical type.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)

total_repeated = []     # For each chunk, a list of columns containing few unique answers 
total_num = []          # For each chunk, a list of the number of unique answers for each string column
for i,chunk in enumerate(chunk_iter):

    chunk.drop(columns=['id'], inplace=True)        # drop id column
    
    # drop erroneous rows on last and 2nd to last chunks
    if i == 13:
        chunk.drop(chunk.tail(2).index, inplace=True) 
    elif i == 12:
        droprow = 39786
        chunk.drop(labels= droprow, axis = 0, inplace=True) 
        
    
    chunk_repeated = []     # list of columns with few unique values
    chunk_num = []          # list of number of unique answers for each string column

    # list of columns with string dtype
    str_cols = chunk.select_dtypes(include='O').columns.tolist()

    for col in str_cols:
        num_unique = chunk[col].value_counts().shape[0]     # get number of unique answers
        chunk_num.append(num_unique)

        percent_unique = (num_unique/chunk.shape[0])*100    # convert into percentage
        if percent_unique < 40:
            chunk_repeated.append(col)              # append column if % of unique answers is small

    total_repeated.append(chunk_repeated)
    total_num.append(chunk_num)
In [ ]:
for item in total_repeated:
    print(item)
    print(len(item))
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
19
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'addr_state', 'earliest_cr_line', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
17

Since the final chunk is much shorter, it has fewer column that are considered candidates for categorical columns. Overall, however, all 19 columns from the other chunks are good candidates to be categorical columns.

We'll say the columns that appear in every chunk (except the last) will be converted to a category datatype.

In [ ]:
category_candidates = set(total_repeated[0])
for item in total_repeated[0:-1]:
    category_candidates = category_candidates & set(item)       # keep only columns in total_repeated for all chunks ( except last )

category_candidates
Out[ ]:
{'addr_state',
 'application_type',
 'earliest_cr_line',
 'emp_length',
 'grade',
 'home_ownership',
 'initial_list_status',
 'int_rate',
 'issue_d',
 'last_credit_pull_d',
 'last_pymnt_d',
 'loan_status',
 'purpose',
 'pymnt_plan',
 'revol_util',
 'sub_grade',
 'term',
 'verification_status',
 'zip_code'}

Let's look at the total number of unique answers in each of the string columns.

In [ ]:
df = pd.DataFrame(columns=str_cols, data=total_num)
df
Out[ ]:
term int_rate grade sub_grade emp_title emp_length home_ownership verification_status issue_d loan_status pymnt_plan purpose title zip_code addr_state earliest_cr_line revol_util initial_list_status last_pymnt_d last_credit_pull_d application_type
0 2 36 7 35 2820 11 3 3 2 6 1 13 1487 581 43 373 902 1 55 55 1
1 2 37 7 35 2754 11 3 3 4 6 1 13 1532 576 45 404 960 1 56 57 1
2 2 67 7 34 2759 11 3 3 3 5 1 13 1542 581 43 404 966 1 58 59 1
3 2 37 7 35 2747 11 3 3 3 6 1 13 1651 580 43 386 949 1 59 60 1
4 2 86 7 35 2773 11 4 3 4 6 1 13 1700 578 43 393 972 1 61 61 1
5 2 36 7 35 2781 11 3 3 4 3 1 13 2019 579 43 382 957 1 63 63 1
6 2 69 7 35 2753 11 3 3 4 3 1 13 2079 583 43 390 954 1 65 66 1
7 2 70 7 35 2786 11 3 3 4 3 1 14 2059 565 43 375 949 1 64 69 1
8 2 36 7 35 2833 11 3 3 5 3 2 14 2110 551 41 371 965 1 67 72 1
9 2 50 7 34 2833 11 3 3 6 2 1 14 2501 534 42 369 961 1 64 75 1
10 1 65 7 35 2696 11 4 3 7 2 1 14 2554 542 42 382 963 1 52 81 1
11 1 86 7 35 2805 11 4 3 13 2 1 14 2406 577 43 371 947 1 53 91 1
12 2 264 7 35 2757 11 5 3 43 4 1 14 2431 615 50 387 966 1 96 107 1
13 1 97 7 35 833 11 5 2 13 2 1 13 793 377 48 275 580 1 48 97 1

Keep in mind that the following columns are string types which will NOT be converted into a categorical type:

In [ ]:
set(df.columns).difference(category_candidates)     
Out[ ]:
{'emp_title', 'title'}

These columns, based on their names, are descriptive and should not be stored categorically.

Optimization¶

We'll now optimize the space this dataframe takes up by choosing appropriate data types. To begin, let's look at the string columns.

String columns¶

A number of string columns are actually storing datetime info. Here is a list containing the datetime columns.

In [ ]:
dt_cols = ["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
In [ ]:
preview[dt_cols]
Out[ ]:
issue_d earliest_cr_line last_pymnt_d last_credit_pull_d
0 Dec-2011 Jan-1985 Jan-2015 Jun-2016
1 Dec-2011 Apr-1999 Apr-2013 Sep-2013
2 Dec-2011 Nov-2001 Jun-2014 Jun-2016
3 Dec-2011 Feb-1996 Jan-2015 Apr-2016
4 Dec-2011 Jan-1996 Jun-2016 Jun-2016

We see that the datetime format for each of these columns is of the form "%b-%Y", in datetime format codes. We can use the parse_dates and date_format arguments of pd.read_csv() to properly encode the datatype for these columns automatically.

A number of string columns can be converted into a numeric type. We'll list these columns, and remove them from the list of candidates for categorical datatype.

In [ ]:
str_to_numeric_cols = ["term","int_rate","revol_util"]

category_cols = category_candidates.difference(dt_cols)     # remove datetime columns from category columns
category_cols = category_cols.difference(str_to_numeric_cols)       # remove columns we will convert to numeric
category_cols = list(category_cols)

Now we'll create a dictionary which tells pandas which columns we want to be read as categorical.

In [ ]:
category_cols_dict = {}
for col in category_cols:
    category_cols_dict[col] = 'category'        # generate dictionary to use in pd.read_csv(... dtype = )

Let's check the memory consumption after making these changes.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200, 
                         dtype = category_cols_dict, 
                         parse_dates=dt_cols, date_format="%b-%Y")

total = 0
for i,chunk in enumerate(chunk_iter):
    chunk.drop(columns=['id'], inplace=True)
    # drop erroneous rows on last and 2nd to last chunks
    if i == 13:
        chunk.drop(chunk.tail(2).index, inplace=True) 
    elif i == 12:
        droprow = 39786
        chunk.drop(labels= droprow, axis = 0, inplace=True) 

    # remove "months" from term and convert to integer    
    term_clean = chunk["term"].str.strip().str.replace("months","").astype(int)
    chunk["term"] = term_clean

    # remove "%" and convert to float
    int_rate_clean = chunk["int_rate"].str.strip().str.replace("%","").astype(float)
    chunk["int_rate"] = int_rate_clean

    revol_util_clean = chunk["revol_util"].str.strip().str.replace("%","").astype(float)
    chunk["revol_util"] = revol_util_clean

    # compute memory usage
    chunk_mem = chunk.memory_usage(deep=True).sum()/(1024*1024) 
    total += chunk_mem
    print(f"chunk {i}: {round(chunk_mem,3)} MB")

print(f"total memory: {round(total,3)} MB")
chunk 0: 1.451 MB
chunk 1: 1.448 MB
chunk 2: 1.45 MB
chunk 3: 1.449 MB
chunk 4: 1.447 MB
chunk 5: 1.446 MB
chunk 6: 1.448 MB
chunk 7: 1.447 MB
chunk 8: 1.446 MB
chunk 9: 1.455 MB
chunk 10: 1.455 MB
chunk 11: 1.47 MB
chunk 12: 1.486 MB
chunk 13: 0.452 MB
total memory: 19.349 MB

That's already a massive decrease in the memory consumption for the dataframe. The final step is to downcast the numeric columns to the most space efficient subtype.

Numeric Columns¶

We'll complete our workflow by making all previous changes, in addition to any neccesary numeric conversions. The conversions for the numeric columns are as follows:

  • First, check to see if any floating points can be converted to integers. Since the int datatype is not nullable, we'll make sure we only do this for columns without any missing values.
  • We'll then make sure that in either case, we downcast the column to the most space efficient subtype.

This is done in the following code cell.

In [ ]:
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200, dtype = category_cols_dict, parse_dates=dt_cols, date_format="%b-%Y")
total = 0
for i,chunk in enumerate(chunk_iter):
    chunk.drop(columns=['id'], inplace=True)
    # drop erroneous rows on last and 2nd to last chunks
    if i == 13:
        chunk.drop(chunk.tail(2).index, inplace=True) 
    elif i == 12:
        droprow = 39786
        chunk.drop(labels= droprow, axis = 0, inplace=True) 

    # clean columns and convert to numeric     
    term_clean = chunk["term"].str.strip().str.replace("months","").astype(int)
    chunk["term"] = term_clean
    int_rate_clean = chunk["int_rate"].str.strip().str.replace("%","").astype(float)
    chunk["int_rate"] = int_rate_clean
    revol_util_clean = chunk["revol_util"].str.strip().str.replace("%","").astype(float)
    chunk["revol_util"] = revol_util_clean

    floats = chunk.select_dtypes(include='float')
    float_cols = floats.columns.tolist()
    num_missing = floats.isna().sum(axis=0)
   
    for col in float_cols:
         # if float column has missing value, downcast it
        if num_missing[col] > 0:
            chunk[col] = pd.to_numeric(chunk[col], downcast='float') 
        # otherwise, check to see if float columns can be converted to integer
        else:
            is_int = chunk[col].apply(lambda x: x % 1).sum() == 0
            if is_int:
                # convert and downcast when appropriate
                chunk[col] = chunk[col].astype(int)
                chunk[col] = pd.to_numeric(chunk[col], downcast='integer')

    # compute memory
    chunk_mem = chunk.memory_usage(deep=True).sum()/(1024*1024)
    total += chunk_mem
    print(f"chunk {i}: {round(chunk_mem,3)} MB")

print(f"total memory: {round(total,3)} MB")
chunk 0: 1.146 MB
chunk 1: 1.13 MB
chunk 2: 1.145 MB
chunk 3: 1.131 MB
chunk 4: 1.129 MB
chunk 5: 1.129 MB
chunk 6: 1.13 MB
chunk 7: 1.118 MB
chunk 8: 1.116 MB
chunk 9: 1.083 MB
chunk 10: 1.092 MB
chunk 11: 1.107 MB
chunk 12: 1.151 MB
chunk 13: 0.372 MB
total memory: 14.979 MB

Conlusion¶

In conclusion, we used pandas to process a dataset in chunks, remaining under a arbitrary memory constraint of 10MB. After properly manipulating and analyzing each chunk, we optimized the memory footprint of the dataframe by:

  • converting columns with many repeated entries to pandas' "category" datatype
  • converting columns containing datetime information to "datetime" datatype
  • converting floats to ints when appropriate
  • downcasting each numeric datatype to an appropriate, space efficient subtype

Overall, we were able to cut the total memory consumption of the dataframe to 1/3 of the original dataframe. Moreover, we cleaned the data and converted it to a form more useful for data analysts.