Suppose we randomly sample 100 people. What is the probability of fewer than 75 or more than 85 cilantro lovers?
This is also my excuse to review some probability theory and notation
Discussion questions:
The Pearson correlation coefficient is a measure of the linear correlation between two variables
where
General goals:
In the book 3 options are listed to handle the NaN values:
housing.dropna(subset=["total_bedrooms"], inplace=True) ## option 1
housing.drop("total_bedrooms", axis=1) ## option 2
median = housing["total_bedrooms"].median() ## option 3
housing["total_bedrooms"].fillna(median, inplace=True)
Discussion questions:
Most of the math in ML algorithms is based on numbers, so we need to convert text and categorical attributes to numbers. This is called encoding.
Discussion questions:
Many ML algorithms don't like features with vastly different scales. Common scaling methods are min-max scaling and standardization.
Important: scaling is computed on the training set and applied to the validation and test sets - they are not scaled independently!
Discussions questions:
A general Gaussian distribution is given by:
where