Shared Flashcard Set

Details

Title

Data Science Interview

Description

Prep for Data Scientist Interview

Total Cards

Subject

Computer Science

Level

Professional

Created

03/29/2022

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

supervised learning

Definition

Uses known and labeled data as input.
Supervised learning has a feedback mechanism.
The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine.

Term

Unsupervised Learning

Definition

Uses unlabeled data as input
Unsupervised learning has no feedback mechanism
The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

Term

How can you avoid overfitting your model?

Definition

Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting

Term

What are the feature selection methods used to select the right variables?

Definition

filter, and wrapper methods

Term

Filter method for variable selection

Definition

Linear discrimination analysis
ANOVA
Chi-Square
The best analogy for selecting features is "bad data in, bad answer out." When we're limiting or selecting the features, it's all about cleaning up the data coming in

Term

Filter method for variable selection

Definition

Forward Selection: We test one feature at a time and keep adding them until we get a good fit
Backward Selection: We test all the features and start removing them to see what works better
Recursive Feature Elimination: Recursively looks through all the different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

Term

You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

Definition

If the data set is large, we can just simply remove the rows with missing data values
For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).

Term

What are dimensionality reduction and its benefits?

Definition

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches).

Term

Difference between Point Estimates and Confidence Interval

Definition

Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha.

Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters’ Point estimators are - Maximum Likelihood estimator and the Method of Moments.

To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias.

Term

Standardization

Definition

The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0.
Standardization takes care that the standard normal distribution is followed by the data.
Normalization formula -
X’ = (X - Xmin) / (Xmax - Xmin)

Here,

Xmin - feature’s minimum value,

Xmax - feature’s maximum value

Term

Normalization

Definition

The technique of converting all data values to lie between 1 and 0 is known as Normalization. This is also known as min-max scaling.
The data returning into the 0 to 1 range is taken care of by Normalization.
Standardization formula -
X’ = (X -

Term

Why is R used in Data Visualization?

Definition

R is widely used in Data Visualizations for the following reasons-

We can create almost any type of graph using R.
R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
It is easier to customize graphics in R compared to Python.
R is used in feature engineering and in exploratory data analysis as well.

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile