Term
|
Definition
Uses known and labeled data as input. Supervised learning has a feedback mechanism. The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine. |
|
|
Term
|
Definition
Uses unlabeled data as input Unsupervised learning has no feedback mechanism The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm |
|
|
Term
How can you avoid overfitting your model? |
|
Definition
Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data Use cross-validation techniques, such as k folds cross-validation Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting |
|
|
Term
What are the feature selection methods used to select the right variables? |
|
Definition
filter, and wrapper methods |
|
|
Term
Filter method for variable selection |
|
Definition
Linear discrimination analysis ANOVA Chi-Square The best analogy for selecting features is "bad data in, bad answer out." When we're limiting or selecting the features, it's all about cleaning up the data coming in |
|
|
Term
Filter method for variable selection |
|
Definition
Forward Selection: We test one feature at a time and keep adding them until we get a good fit Backward Selection: We test all the features and start removing them to see what works better Recursive Feature Elimination: Recursively looks through all the different features and how they pair together Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method. |
|
|
Term
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them? |
|
Definition
If the data set is large, we can just simply remove the rows with missing data values For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean). |
|
|
Term
What are dimensionality reduction and its benefits? |
|
Definition
The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches). |
|
|
Term
Difference between Point Estimates and Confidence Interval |
|
Definition
Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha.
Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters’ Point estimators are - Maximum Likelihood estimator and the Method of Moments.
To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias. |
|
|
Term
|
Definition
The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0. Standardization takes care that the standard normal distribution is followed by the data. Normalization formula - X’ = (X - Xmin) / (Xmax - Xmin)
Here,
Xmin - feature’s minimum value,
Xmax - feature’s maximum value |
|
|
Term
|
Definition
The technique of converting all data values to lie between 1 and 0 is known as Normalization. This is also known as min-max scaling. The data returning into the 0 to 1 range is taken care of by Normalization. Standardization formula - X’ = (X - |
|
|
Term
Why is R used in Data Visualization? |
|
Definition
R is widely used in Data Visualizations for the following reasons-
We can create almost any type of graph using R. R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well. It is easier to customize graphics in R compared to Python. R is used in feature engineering and in exploratory data analysis as well. |
|
|