Sometimes, data engineers have to handle data sets which are too large to fit into memory at once. The workflow for cleaning and analyzing the data in this case becomes slightly more complicated. In this notebook, we'll demonstrate how this can be done using chunk processing in pandas. Thorughout this notebook, we'll assume an artifical memory limit of 10MB to demonstrate working with memory constraints.
After reading in the data in chunks of 5,000 rows (each of which consumes about 5MB, or 50% of our memory total), we were able to:
After completing the above steps, we decreased the total memory footprint of the data from 65MB to only 22MB! This optimization allows for the data analysis process to go much more smoothly, since calculations done with only a few selected columns will likely be within the artificial memory constraint.
The dataset used throughout this notebook can be found here. This dataset comes from Lending Club, which is a service that connects those seeking personal loans with potential investors. Since the focus of this notebook is demonstrating a workflow using pandas that scales well with large amounts of data, we omit a data dictionary for this project.
To begin, we'll read in the first five rows just to get a peak at the dataset.
import pandas as pd
pd.options.display.max_columns = 99 # Let's us see all columns
preview = pd.read_csv('data/loans_2007.csv', nrows = 5)
preview
id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | inq_last_6mths | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | last_credit_pull_d | collections_12_mths_ex_med | policy_code | application_type | acc_now_delinq | chargeoff_within_12_mths | delinq_amnt | pub_rec_bankruptcies | tax_liens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1077501 | 1296599.0 | 5000.0 | 5000.0 | 4975.0 | 36 months | 10.65% | 162.87 | B | B2 | NaN | 10+ years | RENT | 24000.0 | Verified | Dec-2011 | Fully Paid | n | credit_card | Computer | 860xx | AZ | 27.65 | 0.0 | Jan-1985 | 1.0 | 3.0 | 0.0 | 13648.0 | 83.7% | 9.0 | f | 0.00 | 0.00 | 5863.155187 | 5833.84 | 5000.00 | 863.16 | 0.00 | 0.00 | 0.00 | Jan-2015 | 171.62 | Jun-2016 | 0.0 | 1.0 | INDIVIDUAL | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1077430 | 1314167.0 | 2500.0 | 2500.0 | 2500.0 | 60 months | 15.27% | 59.83 | C | C4 | Ryder | < 1 year | RENT | 30000.0 | Source Verified | Dec-2011 | Charged Off | n | car | bike | 309xx | GA | 1.00 | 0.0 | Apr-1999 | 5.0 | 3.0 | 0.0 | 1687.0 | 9.4% | 4.0 | f | 0.00 | 0.00 | 1008.710000 | 1008.71 | 456.46 | 435.17 | 0.00 | 117.08 | 1.11 | Apr-2013 | 119.66 | Sep-2013 | 0.0 | 1.0 | INDIVIDUAL | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1077175 | 1313524.0 | 2400.0 | 2400.0 | 2400.0 | 36 months | 15.96% | 84.33 | C | C5 | NaN | 10+ years | RENT | 12252.0 | Not Verified | Dec-2011 | Fully Paid | n | small_business | real estate business | 606xx | IL | 8.72 | 0.0 | Nov-2001 | 2.0 | 2.0 | 0.0 | 2956.0 | 98.5% | 10.0 | f | 0.00 | 0.00 | 3005.666844 | 3005.67 | 2400.00 | 605.67 | 0.00 | 0.00 | 0.00 | Jun-2014 | 649.91 | Jun-2016 | 0.0 | 1.0 | INDIVIDUAL | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1076863 | 1277178.0 | 10000.0 | 10000.0 | 10000.0 | 36 months | 13.49% | 339.31 | C | C1 | AIR RESOURCES BOARD | 10+ years | RENT | 49200.0 | Source Verified | Dec-2011 | Fully Paid | n | other | personel | 917xx | CA | 20.00 | 0.0 | Feb-1996 | 1.0 | 10.0 | 0.0 | 5598.0 | 21% | 37.0 | f | 0.00 | 0.00 | 12231.890000 | 12231.89 | 10000.00 | 2214.92 | 16.97 | 0.00 | 0.00 | Jan-2015 | 357.48 | Apr-2016 | 0.0 | 1.0 | INDIVIDUAL | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1075358 | 1311748.0 | 3000.0 | 3000.0 | 3000.0 | 60 months | 12.69% | 67.79 | B | B5 | University Medical Group | 1 year | RENT | 80000.0 | Source Verified | Dec-2011 | Current | n | other | Personal | 972xx | OR | 17.94 | 0.0 | Jan-1996 | 0.0 | 15.0 | 0.0 | 27783.0 | 53.9% | 38.0 | f | 461.73 | 461.73 | 3581.120000 | 3581.12 | 2538.27 | 1042.85 | 0.00 | 0.00 | 0.00 | Jun-2016 | 67.79 | Jun-2016 | 0.0 | 1.0 | INDIVIDUAL | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
We have a few observations right off the bat:
Let's use chunk processing to compute the memory footprint of the entire dataset. We'll use that 1 Megabyte (MB) is equal to $1024^2$ bytes.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
total = 0
for i,chunk in enumerate(chunk_iter):
chunk_mem = chunk.memory_usage(deep=True).sum()/(1024*1024) # Convert memory to MBs
total += chunk_mem
print(f"chunk {i}: {round(chunk_mem,3)} MB")
print(f"total memory: {round(total,3)} MB")
chunk 0: 4.886 MB chunk 1: 4.881 MB chunk 2: 4.884 MB chunk 3: 4.883 MB chunk 4: 4.881 MB chunk 5: 4.881 MB chunk 6: 4.882 MB chunk 7: 4.883 MB chunk 8: 4.881 MB chunk 9: 4.891 MB chunk 10: 4.891 MB chunk 11: 4.903 MB chunk 12: 5.128 MB chunk 13: 1.518 MB total memory: 65.274 MB
We see that using a 3200 row chunksize ensures that the maximum memory footprint among all the chunks is around 50% of our total memory requirement. We'll use this chunksize for the remainder of the notebook. Let's check which datatypes are present throughout the chunks.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
dtypes = [chunk.dtypes.value_counts().index.tolist() for chunk in chunk_iter]
dtypes
[[dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O'), dtype('int64')], [dtype('float64'), dtype('O')], [dtype('float64'), dtype('O')]]
We see that the datatypes among the last two chunks differs from the other chunks. Let's take a closer look by seeing how many columns of each type there are per chunk.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
strings = []
floats = []
ints = []
for chunk in chunk_iter:
strings.append(chunk.select_dtypes(include='O').shape[1]) # .shape[1] selects number of columns of DataFrame
floats.append(chunk.select_dtypes(include='float64').shape[1]) # of each dtype
ints.append(chunk.select_dtypes(include='int64').shape[1])
print(f"Object cols: {strings}")
print(f"Float cols: {floats}")
print(f"Int cols: {ints}")
Object cols: [21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22] Float cols: [30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30] Int cols: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
We see that in the last two chunks, one of the integer columns is being interpreted as an object column. Let's see which column this is.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
for i,chunk in enumerate(chunk_iter):
if i == 1:
initial_obj_set = set(chunk.select_dtypes(include='O').columns.tolist()) # set of objcet columns for most chunks
elif i == 13:
final_obj_set = set(chunk.select_dtypes(include='O').columns.tolist()) # set of object columns for final chunk
col = final_obj_set.difference(initial_obj_set) # get extra object column
col
{'id'}
Interestingly, the id column is being read in as an object in the final two rows. Looking at the tail of the last chunk reveals what is occuring.
chunk.tail(3)
id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | inq_last_6mths | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | last_credit_pull_d | collections_12_mths_ex_med | policy_code | application_type | acc_now_delinq | chargeoff_within_12_mths | delinq_amnt | pub_rec_bankruptcies | tax_liens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42535 | 70686 | 70681.0 | 5000.0 | 5000.0 | 0.0 | 36 months | 7.75% | 156.11 | A | A3 | Homemaker | 10+ years | MORTGAGE | 70000.0 | Not Verified | Jun-2007 | Does not meet the credit policy. Status:Fully ... | n | other | Aroundthehouse | 068xx | CT | 8.81 | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | f | 0.0 | 0.0 | 5619.76209 | 0.0 | 5000.0 | 619.76 | 0.0 | 0.0 | 0.0 | Jun-2010 | 156.39 | Feb-2015 | NaN | 1.0 | INDIVIDUAL | NaN | NaN | NaN | NaN | NaN |
42536 | Total amount funded in policy code 1: 471701350 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
42537 | Total amount funded in policy code 2: 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We see that the last two rows are erroneous and do not fit the file format. We can avoid this error by dropping the last two rows of the final chunk. In some cases, we may want to make sure this information is still documented. However, this step is not neccesary for this notebook.
Looking at the number of int cols for the second to last chunk, it seems that a similar phenomenon is occuring here. Let's take a closer look.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
for i,chunk in enumerate(chunk_iter):
if i == 13:
chunk.drop(chunk.tail(2).index, inplace=True) # drop erroneous rows on final chunk
chunk["id"] = chunk["id"].astype('int64') # fix dtype
obj_set = set(chunk.select_dtypes(include='O'))
elif i == 12:
second_to_last = chunk
obj_set_wrong = set(chunk.select_dtypes(include='O'))
col = obj_set_wrong.difference(obj_set)
col
{'id'}
We suspect from the previous analysis that one row, where the id is a string, is erroneous. Let's see which row this is using regex to match rows that don't contain only digits.
second_to_last.loc[~second_to_last["id"].str.contains('[0-9]+'),:] # match rows which do not contain integer in id column
id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | inq_last_6mths | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | last_credit_pull_d | collections_12_mths_ex_med | policy_code | application_type | acc_now_delinq | chargeoff_within_12_mths | delinq_amnt | pub_rec_bankruptcies | tax_liens | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39786 | Loans that do not meet the credit policy | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
This indicates that rows below this value do not meet the credit policy. We could create a seperate column keeping track of this. However, this is not needed in our case. In fact, since the id does not contain relevant information, we simultaneously fix the problems with the last two chunks by simply dropping the id column for each chunk. We do this throughout the remainder of the notebook.
Now, let's determine if many columns contain multiple repeated values. We'll count the percentage of unique values in each column, and see which columns contain less than 40% unique values. These columns will be candiates for converting to a categorical type.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200)
total_repeated = [] # For each chunk, a list of columns containing few unique answers
total_num = [] # For each chunk, a list of the number of unique answers for each string column
for i,chunk in enumerate(chunk_iter):
chunk.drop(columns=['id'], inplace=True) # drop id column
# drop erroneous rows on last and 2nd to last chunks
if i == 13:
chunk.drop(chunk.tail(2).index, inplace=True)
elif i == 12:
droprow = 39786
chunk.drop(labels= droprow, axis = 0, inplace=True)
chunk_repeated = [] # list of columns with few unique values
chunk_num = [] # list of number of unique answers for each string column
# list of columns with string dtype
str_cols = chunk.select_dtypes(include='O').columns.tolist()
for col in str_cols:
num_unique = chunk[col].value_counts().shape[0] # get number of unique answers
chunk_num.append(num_unique)
percent_unique = (num_unique/chunk.shape[0])*100 # convert into percentage
if percent_unique < 40:
chunk_repeated.append(col) # append column if % of unique answers is small
total_repeated.append(chunk_repeated)
total_num.append(chunk_num)
for item in total_repeated:
print(item)
print(len(item))
['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 19 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'addr_state', 'earliest_cr_line', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 17
Since the final chunk is much shorter, it has fewer column that are considered candidates for categorical columns. Overall, however, all 19 columns from the other chunks are good candidates to be categorical columns.
We'll say the columns that appear in every chunk (except the last) will be converted to a category datatype.
category_candidates = set(total_repeated[0])
for item in total_repeated[0:-1]:
category_candidates = category_candidates & set(item) # keep only columns in total_repeated for all chunks ( except last )
category_candidates
{'addr_state', 'application_type', 'earliest_cr_line', 'emp_length', 'grade', 'home_ownership', 'initial_list_status', 'int_rate', 'issue_d', 'last_credit_pull_d', 'last_pymnt_d', 'loan_status', 'purpose', 'pymnt_plan', 'revol_util', 'sub_grade', 'term', 'verification_status', 'zip_code'}
Let's look at the total number of unique answers in each of the string columns.
df = pd.DataFrame(columns=str_cols, data=total_num)
df
term | int_rate | grade | sub_grade | emp_title | emp_length | home_ownership | verification_status | issue_d | loan_status | pymnt_plan | purpose | title | zip_code | addr_state | earliest_cr_line | revol_util | initial_list_status | last_pymnt_d | last_credit_pull_d | application_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 36 | 7 | 35 | 2820 | 11 | 3 | 3 | 2 | 6 | 1 | 13 | 1487 | 581 | 43 | 373 | 902 | 1 | 55 | 55 | 1 |
1 | 2 | 37 | 7 | 35 | 2754 | 11 | 3 | 3 | 4 | 6 | 1 | 13 | 1532 | 576 | 45 | 404 | 960 | 1 | 56 | 57 | 1 |
2 | 2 | 67 | 7 | 34 | 2759 | 11 | 3 | 3 | 3 | 5 | 1 | 13 | 1542 | 581 | 43 | 404 | 966 | 1 | 58 | 59 | 1 |
3 | 2 | 37 | 7 | 35 | 2747 | 11 | 3 | 3 | 3 | 6 | 1 | 13 | 1651 | 580 | 43 | 386 | 949 | 1 | 59 | 60 | 1 |
4 | 2 | 86 | 7 | 35 | 2773 | 11 | 4 | 3 | 4 | 6 | 1 | 13 | 1700 | 578 | 43 | 393 | 972 | 1 | 61 | 61 | 1 |
5 | 2 | 36 | 7 | 35 | 2781 | 11 | 3 | 3 | 4 | 3 | 1 | 13 | 2019 | 579 | 43 | 382 | 957 | 1 | 63 | 63 | 1 |
6 | 2 | 69 | 7 | 35 | 2753 | 11 | 3 | 3 | 4 | 3 | 1 | 13 | 2079 | 583 | 43 | 390 | 954 | 1 | 65 | 66 | 1 |
7 | 2 | 70 | 7 | 35 | 2786 | 11 | 3 | 3 | 4 | 3 | 1 | 14 | 2059 | 565 | 43 | 375 | 949 | 1 | 64 | 69 | 1 |
8 | 2 | 36 | 7 | 35 | 2833 | 11 | 3 | 3 | 5 | 3 | 2 | 14 | 2110 | 551 | 41 | 371 | 965 | 1 | 67 | 72 | 1 |
9 | 2 | 50 | 7 | 34 | 2833 | 11 | 3 | 3 | 6 | 2 | 1 | 14 | 2501 | 534 | 42 | 369 | 961 | 1 | 64 | 75 | 1 |
10 | 1 | 65 | 7 | 35 | 2696 | 11 | 4 | 3 | 7 | 2 | 1 | 14 | 2554 | 542 | 42 | 382 | 963 | 1 | 52 | 81 | 1 |
11 | 1 | 86 | 7 | 35 | 2805 | 11 | 4 | 3 | 13 | 2 | 1 | 14 | 2406 | 577 | 43 | 371 | 947 | 1 | 53 | 91 | 1 |
12 | 2 | 264 | 7 | 35 | 2757 | 11 | 5 | 3 | 43 | 4 | 1 | 14 | 2431 | 615 | 50 | 387 | 966 | 1 | 96 | 107 | 1 |
13 | 1 | 97 | 7 | 35 | 833 | 11 | 5 | 2 | 13 | 2 | 1 | 13 | 793 | 377 | 48 | 275 | 580 | 1 | 48 | 97 | 1 |
Keep in mind that the following columns are string types which will NOT be converted into a categorical type:
set(df.columns).difference(category_candidates)
{'emp_title', 'title'}
These columns, based on their names, are descriptive and should not be stored categorically.
dt_cols = ["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
preview[dt_cols]
issue_d | earliest_cr_line | last_pymnt_d | last_credit_pull_d | |
---|---|---|---|---|
0 | Dec-2011 | Jan-1985 | Jan-2015 | Jun-2016 |
1 | Dec-2011 | Apr-1999 | Apr-2013 | Sep-2013 |
2 | Dec-2011 | Nov-2001 | Jun-2014 | Jun-2016 |
3 | Dec-2011 | Feb-1996 | Jan-2015 | Apr-2016 |
4 | Dec-2011 | Jan-1996 | Jun-2016 | Jun-2016 |
We see that the datetime format for each of these columns is of the form "%b-%Y", in datetime format codes. We can use the parse_dates and date_format arguments of pd.read_csv() to properly encode the datatype for these columns automatically.
A number of string columns can be converted into a numeric type. We'll list these columns, and remove them from the list of candidates for categorical datatype.
str_to_numeric_cols = ["term","int_rate","revol_util"]
category_cols = category_candidates.difference(dt_cols) # remove datetime columns from category columns
category_cols = category_cols.difference(str_to_numeric_cols) # remove columns we will convert to numeric
category_cols = list(category_cols)
Now we'll create a dictionary which tells pandas which columns we want to be read as categorical.
category_cols_dict = {}
for col in category_cols:
category_cols_dict[col] = 'category' # generate dictionary to use in pd.read_csv(... dtype = )
Let's check the memory consumption after making these changes.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200,
dtype = category_cols_dict,
parse_dates=dt_cols, date_format="%b-%Y")
total = 0
for i,chunk in enumerate(chunk_iter):
chunk.drop(columns=['id'], inplace=True)
# drop erroneous rows on last and 2nd to last chunks
if i == 13:
chunk.drop(chunk.tail(2).index, inplace=True)
elif i == 12:
droprow = 39786
chunk.drop(labels= droprow, axis = 0, inplace=True)
# remove "months" from term and convert to integer
term_clean = chunk["term"].str.strip().str.replace("months","").astype(int)
chunk["term"] = term_clean
# remove "%" and convert to float
int_rate_clean = chunk["int_rate"].str.strip().str.replace("%","").astype(float)
chunk["int_rate"] = int_rate_clean
revol_util_clean = chunk["revol_util"].str.strip().str.replace("%","").astype(float)
chunk["revol_util"] = revol_util_clean
# compute memory usage
chunk_mem = chunk.memory_usage(deep=True).sum()/(1024*1024)
total += chunk_mem
print(f"chunk {i}: {round(chunk_mem,3)} MB")
print(f"total memory: {round(total,3)} MB")
chunk 0: 1.451 MB chunk 1: 1.448 MB chunk 2: 1.45 MB chunk 3: 1.449 MB chunk 4: 1.447 MB chunk 5: 1.446 MB chunk 6: 1.448 MB chunk 7: 1.447 MB chunk 8: 1.446 MB chunk 9: 1.455 MB chunk 10: 1.455 MB chunk 11: 1.47 MB chunk 12: 1.486 MB chunk 13: 0.452 MB total memory: 19.349 MB
That's already a massive decrease in the memory consumption for the dataframe. The final step is to downcast the numeric columns to the most space efficient subtype.
We'll complete our workflow by making all previous changes, in addition to any neccesary numeric conversions. The conversions for the numeric columns are as follows:
This is done in the following code cell.
chunk_iter = pd.read_csv('data/loans_2007.csv', chunksize=3200, dtype = category_cols_dict, parse_dates=dt_cols, date_format="%b-%Y")
total = 0
for i,chunk in enumerate(chunk_iter):
chunk.drop(columns=['id'], inplace=True)
# drop erroneous rows on last and 2nd to last chunks
if i == 13:
chunk.drop(chunk.tail(2).index, inplace=True)
elif i == 12:
droprow = 39786
chunk.drop(labels= droprow, axis = 0, inplace=True)
# clean columns and convert to numeric
term_clean = chunk["term"].str.strip().str.replace("months","").astype(int)
chunk["term"] = term_clean
int_rate_clean = chunk["int_rate"].str.strip().str.replace("%","").astype(float)
chunk["int_rate"] = int_rate_clean
revol_util_clean = chunk["revol_util"].str.strip().str.replace("%","").astype(float)
chunk["revol_util"] = revol_util_clean
floats = chunk.select_dtypes(include='float')
float_cols = floats.columns.tolist()
num_missing = floats.isna().sum(axis=0)
for col in float_cols:
# if float column has missing value, downcast it
if num_missing[col] > 0:
chunk[col] = pd.to_numeric(chunk[col], downcast='float')
# otherwise, check to see if float columns can be converted to integer
else:
is_int = chunk[col].apply(lambda x: x % 1).sum() == 0
if is_int:
# convert and downcast when appropriate
chunk[col] = chunk[col].astype(int)
chunk[col] = pd.to_numeric(chunk[col], downcast='integer')
# compute memory
chunk_mem = chunk.memory_usage(deep=True).sum()/(1024*1024)
total += chunk_mem
print(f"chunk {i}: {round(chunk_mem,3)} MB")
print(f"total memory: {round(total,3)} MB")
chunk 0: 1.146 MB chunk 1: 1.13 MB chunk 2: 1.145 MB chunk 3: 1.131 MB chunk 4: 1.129 MB chunk 5: 1.129 MB chunk 6: 1.13 MB chunk 7: 1.118 MB chunk 8: 1.116 MB chunk 9: 1.083 MB chunk 10: 1.092 MB chunk 11: 1.107 MB chunk 12: 1.151 MB chunk 13: 0.372 MB total memory: 14.979 MB
In conclusion, we used pandas to process a dataset in chunks, remaining under a arbitrary memory constraint of 10MB. After properly manipulating and analyzing each chunk, we optimized the memory footprint of the dataframe by:
Overall, we were able to cut the total memory consumption of the dataframe to 1/3 of the original dataframe. Moreover, we cleaned the data and converted it to a form more useful for data analysts.