Data Cleaning and Exploratory Data Analysis

Imputation

No imputation was performed.

We found nan value in 2 categories:

1. name, description, date, review: Their absence does not prevent us from obtaining nutritional health information and complexity from other columns, so we decided to keep the data records with these Nan columns.

2. reviewer_id, rating, avg_rating: reviewer_id and avg_rating will only be Nan when a recipe does not match any reviewer’s reviews, and the Nan of rating may come from the absence of reviews or the existence of reviews but the value of the rating itself is not successfully entered. Since exploring how recipes are rated are what we want to do next, we drop them at this stage.

Data Cleaning

1 Data Cleaning

Step	Why
1. Drop Nan	rating-related nan rows are dropped (reason discussed above).
2. Filter to recipes with > 3 ratings	‘avg_rating’ is used as as a summary of each recipe’s overall rating. Averages based on 1-2 votes are too noisy to trust.
3. Extract Interested Columns	‘avg_rating’ represents recipes’ overally ratings, ‘minutes’, ‘n_steps’, and ‘n_ingredients’ represent complexity,’nutrition’ and ‘tags’ bring nutritional health information. They contain all the information we are interested to investigate.
4. Extract nutrition components from `nutrition`	‘nutrition’ column is in the form of a string to mimic a list. We toke the nutition components out and converted to numeric data for health analysis (‘nutrition’ is dropped after extraction).
5. Remove implausible nutrition rows	Values far past 100 %DV are likely scraped/entry errors (e.g., “30 000 % sugar”), so only values ≤ 150 are kept.
6. Add `is_healthy` flag from `tags`	We marked `tags` containing healthy, healthy-2, high-in-something-diabetic-friendly true for a new column `is_healthy`, since tags are author-entered meta-data whcih summaize the overall healthiness better than single nutrients (‘tags’ is dropped after extraction).

(Note: Although there are tags like low-saturated-fat, they only reflect one aspect of nutrition. Since we cannot guarantee the overall healthiness of the recipe based on such tags alone, we only selected tags that more clearly indicate a healthy recipe as a whole.)

After the data cleaning, there are 13878 rows left.

Below is the head of the cleaned modeling table (one row per rated recipe):

id	avg_rating	minutes	n_steps	n_ingredients	calories	total_fat	sugar	sodium	protein	saturated_fat	carbohydrates	is_healthy
275030	5.00	45	11	9	577.7	53.0	149.0	19.0	14.0	67.0	21.0	False
275061	4.78	65	16	11	402.5	37.0	9.0	40.0	42.0	56.0	8.0	False
275071	4.86	90	7	9	166.1	10.0	6.0	0.0	7.0	4.0	7.0	True
275072	4.25	35	8	11	227.5	12.0	6.0	11.0	67.0	12.0	1.0	False
275094	4.90	12	3	5	99.5	7.0	23.0	14.0	11.0	3.0	3.0	True

Univariate Analysis

Interpretation** The histogram above shows that most recipes cluster between 5–15 steps, with a peak around 8 steps. It shows some variation which might help us learn recipes’ ratings. Since it basically follows a normal distribution, we don’t need to adjust its distribution anymore if it’s used in prediciton model.

Distribution of Average Ratings

Interpretation.
The histogram of avg_rating is highly left-skewed: the vast majority of recipes cluster at the top end of the 1–5 scale. This likely reflects a positivity bias—home cooks are more inclined to leave a review when they’ve had a good experience.

Modeling implication: Because our target (high_rating) is defined on this skewed distribution (only ≈20 % of recipes exceed 4.5 stars), we might need to address class imbalance in our prediction pipeline.

Bivariate Analysis

Calories vs Average Rating

Interpretation The ratings distribution for low-calorie and medium-calorie recipes isrelatively balanced, as they appear across the entire rating spectrum. However, we observed a modest uptick in average calories at the highest rating bins. This might suggests that there is a segment of users who reward richer, higher-calorie dishes with top scores—justifying calories as a useful feature in our prediction model.

Interesting Aggregates

Mean & Median of Key Metrics by Rating Bin

Mean

rating_bin	minutes	n_steps	n_ingredients	calories	sugar	sodium	saturated_fat
1	19.75	7.25	6.00	237.12	65.00	8.00	17.00
2	79.18	8.65	8.47	255.66	37.47	18.53	23.15
3	82.46	9.99	9.37	332.47	31.78	22.05	29.28
4	72.23	9.34	8.95	313.39	29.87	21.27	27.38
5	60.64	9.56	8.81	303.72	32.86	20.05	27.78

Median

rating_bin	minutes	n_steps	n_ingredients	calories	sugar	sodium	saturated_fat
1	20.0	7.5	7.0	249.30	59.0	8.0	21.5
2	34.0	7.5	8.0	218.40	16.0	12.5	16.0
3	40.0	9.0	9.0	278.40	18.0	12.0	17.0
4	35.0	8.0	9.0	267.40	18.0	14.0	18.0
5	30.0	8.0	8.0	253.65	19.0	13.0	18.0

Significance:

The mean and median distribution indicate there are many large outliers in minutes, calories, sugar, sodium, and saturated_fat, which suggests that we should probably apply some feature engineering on them if we want to use for prediction.
Pattern: it seems low-sugar recipes tend to have higher ratings. The complexity, calories, sodium seem to follow a bell-shaped distribution with ratings. Perhaps people acknowledge the taste of food made from more complex/unhealthy recipes, but slightly lower their ratings due to the complexity/unhealthiness, which results in such a distribution.

Here are Zhicong and Tianxin (Jessica).

Email: zhicong@umich.edu & ltianxin@umich.edu