A Data Science Project For DSC80 At The University of California, San Diego
by So Hirota (hirotaso92602@gmail.com)
Published 2/24/2023
I will analyze the dataset provided using Pandas, Numpy, Plotly, hypothesis testing, and permutation testing.
The recipes dataset contains two .csv files: the RAW_recpies and the RAW_interactions dataset.
RAW_recipes.csv contains 83782 rows
and 12 columns
. The rows represent the recipes, and the columns contain name
, id
, minutes
, contributor_id
, submitted
, tags
, nutrition
, n_steps
, steps
, description
, ingredients
, n_ingredients
. nutrition
is in “Percentage Daily Value (PDV)” besides calories (#)
, which is kilocalories.
column name | meaning |
---|---|
name |
the name of the recipe |
id |
the id of the recipe |
minutes |
the time it takes to make the recipe |
contributor_id |
the id of the recipe contributor |
submitted |
the date the recipe was submitted, in YY-MM-DD format |
tags |
the tags associated with the recipe |
nutrition |
nutritional information, in order of calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV) |
n_steps |
the number of steps the recipe requires |
steps |
the descriptions of each step |
description |
the description of the recipe |
ingredients |
the ingredients of the recipe |
n_ingredients |
the number of ingredients required to make the recipe |
RAW_interactions.csv contains 731927 rows
and 5 columns
. The rows represent an individual review of a recipe, and the columns contain user_id
, recipe_id
, date
, rating
, review
.
columns name | meaning |
---|---|
user_id |
user id of the user who posted a review |
recipe_id |
recipe id for the review, same as the ones in RAW_recipes.csv |
date |
the date that the reivew was posted |
rating |
the star rating of the recipe, from 1 - 5 |
review |
the text review of the recipe |
The analysis in this notebook will be centered around this one question. The columns that may be relevant to the analysis include mintues
, nutrition
(this contains the calorie information), tags
, n_steps
, n_ingredients
, ingredients
, and rating.
By investigating this question, a person attempting a diet may be able to avoid high calorie recipes based on the key factors that correlated with high caloric recipes.
pd.read_csv()
recipes.head(2)
name | id | minutes | contributor_id | submitted | tags | nutrition | n_steps | steps | description | ingredients | n_ingredients |
---|---|---|---|---|---|---|---|---|---|---|---|
1 brownies in the world best ever | 333281 | 40 | 985201 | 2008-10-27 | [‘60-minutes-or-less’, | [138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0] | 10 | [‘heat the oven to 350f | these are the most; | [‘bittersweet chocolate’, | 9 |
1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | 2011-04-11 | [‘60-minutes-or-less’, | [595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0] | 12 | [‘pre-heat oven the 350 | this is the recipe that | [‘white sugar’, ‘brown | 11 |
interactions.head(5)
user_id | recipe_id | date | rating | review |
---|---|---|---|---|
1293707 | 40893 | 2011-12-21 | 5 | So simple, so delicious! Great fo |
126440 | 85009 | 2010-02-27 | 5 | I made the Mexican topping and to |
57222 | 85009 | 2011-10-01 | 5 | Made the cheddar bacon topping… |
124416 | 120345 | 2011-08-06 | 0 | Just an observation, so I will no |
2000192946 | 120345 | 2015-05-10 | 2 | This recipe was OVERLY too sweet. |
merged = recipes.merge(interactions, how = 'left', left_on = 'id', right_on = 'recipe_id')
"submitted"
column was a string, so I changed it a datetime object using pd.to_datetime()
"tags"
was a string, which looked like a list
.transform()
MultiLabelBinarizer()
from the sklearn.preprocessing
module to perform one-hot encodingid | 3-steps-or-less | 30-minutes-or-less | 4-hours-or-less | 5-ingredients-or-less |
---|---|---|---|---|
286009 | 0 | 0 | 1 | 0 |
475785 | 0 | 0 | 1 | 0 |
500166 | 1 | 1 | 0 | 0 |
[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), and carbohydrates (PDV)]
'description'
, 'contributor_id'
] because there is likely no use for those columns when trying to answer the question.calories (#)
into a categorical variable by using dfcut()
method, which takes in a dataframe and a bin width to “cut” the dataframe’s calories (#)
column by the given width. This effectively transforms the calories column into a categorical variable. This helps later on when I want to graph relationships between calories and another variable, or when I want to calculate TVD for hypothesis / permutation testing.At the end, dataframe post_clean
looks like this.
post_clean.head(3)
name | id | minutes | submitted | n_steps | n_ingredients | avg_rating | calories (#) | total fat (PDV) | sugar (PDV) | sodium (PDV) | protein (PDV) | saturated fat (PDV) | carbohydrates (PDV) | cal_bins | steps | ingredients |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 brownies in the world best ever | 333281 | 40 | 2008-10-27 00:00:00 | 10 | 9 | 4 | 138.4 | 10 | 50 | 3 | 3 | 19 | 6 | 200 | ‘heat the oven to 350f’ | ‘[bittersweet chocolate’ |
1 in canada chocolate chip cookies | 453467 | 45 | 2011-04-11 00:00:00 | 12 | 11 | 5 | 595.1 | 46 | 211 | 22 | 13 | 51 | 26 | 600 | ‘pre-heat oven the 350’ | ‘[white sugar’, ‘brown’ |
412 broccoli casserole | 306168 | 40 | 2008-05-30 00:00:00 | 6 | 9 | 5 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 200 | ‘preheat oven to 350’ | ‘[frozen broccoli cut’ |
Here, I will inspect the relevant columns of the dataframe individually.
n_ingredients
n_ingredients
calories (#)
calories (#)
post_clean.sort_values(by = 'calories (#)').iloc[-10:][['name', 'calories (#)', 'minutes', 'n_steps', 'n_ingredients', 'avg_rating']]
name | calories (#) | minutes | n_steps | n_ingredients | avg_rating |
---|---|---|---|---|---|
granny jones secret salty sweet biscuit recipe | 17551.6 | 150 | 5 | 5 | 3 |
homesteader s fireweed honey | 17554 | 60 | 6 | 6 | 5 |
cracker snack mix | 18268.7 | 50 | 4 | 10 | 5 |
algerian khobz el dar semolina bread | 18656 | 155 | 17 | 11 | 5 |
alternate honey barbecue sauce with riblets applebee s copycat | 21497.8 | 225 | 16 | 12 | 5 |
coffee glazed doughnuts | 22371.2 | 69 | 19 | 15 | 5 |
hocus pocus cottage cake | 26604.4 | 3000 | 81 | 27 | 5 |
ultimate coconut cake ii | 28930.2 | 120 | 53 | 16 | 5 |
moonshine easy | 36188.8 | 7200 | 27 | 4 | 5 |
powdered hot cocoa mix | 45609 | 10 | 4 | 4 | 5 |
Here, I will inspect the revelant columns of the dataframe in relation to the calories (#)
or cal_bins
column.
total fat (PDV)
vs Calories (#)
total fat (PDV)
total fat (PDV)
and calories (#)
. Interestingly, the association is very apparent until 3800 calories, then becomes a bit more variable to around 12000, then becomes very random after that. This may be due to the decreasing number of data points at higher calories.meat
, main dish
) and the lengths of the bar represents the median value for calories (#)
pork-rib
, whole-chicken
, and wings
were the top three # cutting the post_clean df by 100s for the calories column
hundred_width = dfcut(post_clean, 100)
hundred_width = hundred_width.loc[hundred_width['calories (#)'] <= 1500]
hundred_width.pivot_table(index = 'cal_bins', values = ['minutes', 'n_ingredients', 'avg_rating'], aggfunc = ['mean', 'median'])
cal_bins | (‘mean’, ‘avg_rating’) | (‘mean’, ‘minutes’) | (‘mean’, ‘n_ingredients’) | (‘median’, ‘avg_rating’) | (‘median’, ‘minutes’) | (‘median’, ‘n_ingredients’) |
---|---|---|---|---|---|---|
100 | 4.64557 | 142.603 | 7.07242 | 5 | 20 | 7 |
200 | 4.62707 | 71.1157 | 8.23044 | 5 | 30 | 8 |
300 | 4.62246 | 84.0302 | 9.09422 | 5 | 35 | 9 |
400 | 4.62012 | 91.412 | 9.6538 | 5 | 40 | 9 |
500 | 4.6191 | 195.309 | 10.0194 | 5 | 44 | 10 |
600 | 4.62436 | 100.17 | 10.3268 | 5 | 45 | 10 |
700 | 4.61904 | 91.6952 | 10.5283 | 5 | 45 | 10 |
800 | 4.62502 | 128.646 | 10.6066 | 5 | 45 | 10 |
900 | 4.64032 | 276.698 | 10.7848 | 5 | 50 | 10 |
1000 | 4.61836 | 125.211 | 10.7967 | 5 | 50 | 10 |
1100 | 4.58361 | 120.967 | 10.7433 | 5 | 45 | 10 |
1200 | 4.62894 | 142.561 | 10.3385 | 5 | 45 | 10 |
1300 | 4.58186 | 230.44 | 10.1769 | 5 | 45 | 10 |
1400 | 4.64782 | 281.164 | 10.6641 | 5 | 50 | 10 |
1500 | 4.6636 | 124.933 | 10.1953 | 5 | 50 | 9 |
avg_rating
has no observable pattern in relation to calories
: the means are all around 4.6, while the medians are all 5.0. Perhaps it is uniformly distributed.minutes
seem random when looking at the mean, but when observing the median, we see a general positive correlation
n_ingredients
seems to have a trend of positive corrlation for both means and medians, but levels off at around 600 calories.Let’s quickly go over the missingness types. Definitions will be borrowed from the UCSD DSC80 class, lecture 12.
Type | Definition |
---|---|
Missing By Design (MD) | Missing values can be exactly determined in a column by looking at other columns. |
Not Missing At Random (NMAR) | Missingness of values dependend on the values themselves. |
Missing At Random (MAR) | Missingness of Values dependent on other column(s) in the dataset. |
Missing Completely At Random (MCAR) | Missingness of values does not depend on the column itself or other columns. |
To start looking at missing values, I will identify which columns have missing values.
new_recipes.isna().sum()[new_recipes.isna().sum() != 0]
results in
name 1
description 70
avg_rating 2609
cal_bins 27
dtype: int64
avg_rating
may be NMAR. From the data cleaning process, I filled in the avg_rating
of 0 with np.nan
, thus “creating” missing values in this column artificially. However, this step is reasonable, and was justified in the data cleaning step. We can reason that reviewers may be more likely to provide star ratings to recipes that they either enjoyed or hated. Therefore, the missingness of avg_rating
may be dependent on the star rating itself.
If food.com were to change their reviewing process and added a required question like “How much did you like the recipe”, where the options are thumbs up, thumbs sideways, and thumbs down, I think avg_rating
may become MAR. If we assume my reasoning for avg_rating
being NMAR is correct, then the “0 star rating” should be more associated with the “thumbs sideways” response than the “thumbs up” or “thumbs down” response.
I will attempt to find depndency in missingness in the calories (#)
columns by performing two permutation tests. The significance level will be 0.05.
Null Hypotheses: The distributions of cal_bins
/ n_ingredients
are the same when avg_rating
is missing and not missing.
Alternative Hypotheses: The distributions of cal_bins
/ n_ingredients
are not the same when avg_rating
is missing and not missing.
The significance level will be set at 0.05.
I have two permutations that I want to attempt: one that will shuffle the cal_bins
column, and another one that will shuffle the n_ingredients
column.
calories (#)
column?
The distributions of calories (cal_bins
) by Missingness of avg_rating (graph only shows calories >= 1800)
calories (#)
into a categorical variable using bin width of 10 calories.missing_rating
column, and calculating the test statistic after each runResult of running the permutation test 10000 times
avg_rating
column is dependent on the cal_bins
column (and therefore the calories (#)
column).n_ingredients
column?
missing_rating
column, and calculating the test statistic after each runResult of running the permutation test 10000 times
avg_rating
column is dependent on the n_ingredients
column.We found that there is strong evidence to suggest that avg_rating
is MAR, dependent on calories (#)
, but is likely not dependent on n_ingredients
. This means that reviewers are more likely to review recipes, but not leave a star rating (thus, the “0 stars rating”) on recipes that have higher calories. I cannot come up with a reasonable explanation for this correlation; it may be the case that calories (#)
is a proxy for some other metric, and the avg_rating
is actually NMAR on that column, and appears NAMR dependent on calories (#)
just because of the proxy relationship.
With the power of hypothesis testing, I will try to answer the original question: what kinds of recipes have the highest calories? Recall how I plotted the tags with the top 20 highest median calories? We will go back to the tag
column to help us answer this quetion.
The tag associated with the highest median calories was pork-ribs
. Therefore, I will construct a hypothesis test to determine if there is an association betwen pork-ribs and calories.
Null Hypothesis: calories (#)
and pork-ribs
tag are not related - the high median calories (#)
of recipes with pork-ribs
tag is due to random chance alone.
Alternative Hypothesis: calories (#)
and pork-ribs
tag are related - the high median calories (#)
of recipes with pork-ribs
tag is not due to random chance alone.
The significance level will be set at 0.05.
Test statistic: Median calories
pork-ribs
tag): 790.1
Method: Randomly draw a calories (#)
value 209 times and compute the median 1,000,000 times.pork-ribs
tagResult of 1,000,000 Runs
calories (#)
and pork-ribs
are likely to be related - it is unlikely that results this extreme (p-val of 0) would appear by pure chance. Since the observed statistic was all the way to the right compared to the empircal distribution of median calories (refer to graph below), recipes with the tag pork-ribs
do indeed seem to have higher calories, compared to the calories distribution of the entire recipes dataset (but we can never say for sure!).Recipes for pork ribs seem have to have a tendency for higher calorie values. If you are on a diet, make sure to avoid pork rib dishes! :smile:
calories (#)
and n_ingredients
are right skewed in this dataset.pork-ribs
.avg_rating
seems to be dependent on calories (#)
, although it is hard to pinpoint why it would be dependent on calories. As mentioned in the missingness section, it may be the case that calories (#)
is a proxy for some other factor which avg_rating
is truly dependent on; it just seems like avg_rating
is dependent on calories (#)
.I wish I had a column for serving size, so I could normalize the various metrics of recipes, such as calories (#)
or the other nutritional values. This would have been very useful, as I would have been able to more effectively compare the calorie counts accross different recipes, and perhaps identified a better indicator for high calorie recipes.
Thank you!