A Data Science Project For DSC 80 At The University of California, San Diego
by So Hirota (hirotaso92602@gmail.com)
Published 3/20/2023
Part 1 (EDA) can be found here!
In Part 1 of this project, I performed EDA in addition to some permutation and hypothesis testing on the recipes and interactions dataset. In this part of the project, I will further explore the data and create a machine learning model to predict the average rating of a recipe.
To answer this question,
The recipes dataset contains two .csv files: the RAW_recpies and the RAW_interactions dataset.
RAW_recipes.csv contains 83782 rows
and 12 columns
. The rows represent the recipes, and the columns contain name
, id
, minutes
, contributor_id
, submitted
, tags
, nutrition
, n_steps
, steps
, description
, ingredients
, n_ingredients
. nutrition
is in “Percentage Daily Value (PDV)” besides calories (#)
, which is kilocalories.
column name | meaning |
---|---|
name |
the name of the recipe |
id |
the id of the recipe |
minutes |
the time it takes to make the recipe |
contributor_id |
the id of the recipe contributor |
submitted |
the date the recipe was submitted, in YY-MM-DD format |
tags |
the tags associated with the recipe |
nutrition |
nutritional information, in order of calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV) |
n_steps |
the number of steps the recipe requires |
steps |
the descriptions of each step |
description |
the description of the recipe |
ingredients |
the ingredients of the recipe |
n_ingredients |
the number of ingredients required to make the recipe |
RAW_interactions.csv contains 731927 rows
and 5 columns
. The rows represent an individual review of a recipe, and the columns contain user_id
, recipe_id
, date
, rating
, review
.
columns name | meaning |
---|---|
user_id |
user id of the user who posted a review |
recipe_id |
recipe id for the review, same as the ones in RAW_recipes.csv |
date |
the date that the reivew was posted |
rating |
the star rating of the recipe, from 1 - 5 |
review |
the text review of the recipe |
name
, contributor_id
, steps
, submitted
, and description
. I’m keeping tags because maybe I could use them for feature engineering down the line.review
columnMuch of the code used for this was from Part 1 of the project, which I have linked at the top.
After data cleaning, the two dataframe looks like this.
>>> recipes.shape
(9707, 13)
>>> recipes.head(2)
id | minutes | tags | n_steps | n_ingredients | calories (#) | total fat (%) | sugar (%) | sodium (%) | protein (%) | sat fats (%) | carbs (%) | rating | review |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
333281 | 75 | [‘time-to-make’, ‘cours | 6 | 9 | 1582.6 | 88.0 | 402.0 | 27.0 | 96.0 | 156.0 | 73.0 | 4.400000 | [Loved it and will make |
453467 | 5 | [‘15-minutes-or-less’, | 2 | 11 | 94.7 | 0.0 | 70.0 | 0.0 | 2.0 | 0.0 | 7.0 | 4.800000 | [Love the anise and orange |
>>> interactions.shape
(375987, 3)
>>> interactions.head(2)
recipe_id | rating | review |
---|---|---|
79222 | 4 | Oh, how wonderful! I doubled the crab, and added some |
79222 | 5 | Along with the onions we added in a square of salt pork, |
79222 | 4 | I made this last nite and it was pretty good. I will |
The mode will be as described below.
In order to test the model’s capacity to generalize to unseen data, I performed a train test split and trained the model only on the training data.
>>> model.score(X_test, y_test)
0.0038402175285358053
R^2 score is consistently close to 0.
This word is a very poor model because the r^2 is very close to 0. This means the model is unable to explain any of the variance present in the data. I think that this model has a very low R^2 score because the features that are fed into it are not correlated the rating very well. I will try to transform columns to create these relationships for the final model.
calories
, sodium
, and sugar
. This is because the other nutrition columns were highly correlated to other columns. By dropping these columns, we can prevent multicolinarity. While this change may not necessarily improve the R^2 score of the mode, it will reducing the dimensionality and make the model less complex. This change will most likely not negatively impact the R^2 at the very least.['minutes', 'n_steps', 'n_ingredients']
for standardization because these values have the most outliers, compared to other columns, and therefore think that standraizing would be the most beneficial.{'n_neighbors': 64, 'p': 1}
{'criterion': 'friedman_mse', 'max_depth': 5, 'min_samples_split': 50}
col_trans = ColumnTransformer(
transformers = [
('split_sentiment', FunctionTransformer(split_sentiment), ['review']),
('root_4', FunctionTransformer(func = reduce, kw_args={'n': 0.25}), ['minutes'])
],
remainder = 'passthrough'
)
final_model = Pipeline(
[
('transformer', col_trans),
('dec_tree', DecisionTreeRegressor(criterion = 'friedman_mse', max_depth = 5, min_samples_split = 50))
]
)
>>> final_model.fit(X_train, y_train)
>>> final_model.score(X_test, y_test)
0.5373138645841257
For the final model, I used a Decision Tree Regressor with a max depth of 5, a minimum sample split size of 50, and the friedman mse scoring method. I performed a GridSearch that took over 10 minutes to find these optimal paramters. This mode represents a leap forward in terms of the baseline model, since I am using R^2 scored as the measure of “goodness” of a model for this project. In the baseline model, the Linear Regression model was only able to explain 0.3% of the total variance in the response variable, whilst this model can explain about 54% of the variace present in the response variable. In practice, this means that Decision Tree Regressor model is way better at making predictions than the Linear Regression model.
I will perform a permtuation test to determine if the model can predict the average ratings of recipes that were published before year 2009 as well the average rating of recipes published after year 2009.
Null Hypothesis: Our model is fair. It’s r^2 score for recipes published before and after 2009 are roughly the same, and any difference is due to random chance.
Alternative Hypothesis: Our model is NOT fair. It’s r^2 score is higher for recipes that were published after 2009 than the r^2 score for recipes published before 2009.
Parameters:
The dataframe that I will perform the permutation on looks like this.
>>> permutation_df.head(5)
rating | predicted | before_2009 |
---|---|---|
4.4 | 4.86217 | True |
4.8 | 4.88197 | False |
4.81818 | 4.55379 | True |
5 | 4.36744 | True |
4 | 4.61241 | True |
With a p-value of 0.0, I will reject the null hypothesis. It is likely that our model is unfair, and has a higher r^2 score for predicting recipes that were submitted after 2009, compared to recipes published before 2009.