Data science is a combination of mathematics, business knowledge, tools, algorithms, and machine learning techniques that aid in the discovery of hidden insights or patterns in raw data that can be used to make critical business decisions.222 In data science, both structured and unstructured data are dealt with. Predictive analytic is also used in the algorithms. Thus, data science is concerned with the present and future. That is, identifying trends based on historical data that can be used to make current decisions, as well as identifying patterns that can be modeled and used to make forecasts of how things might look in the future. This article on Top 100 Data Science Interview Questions and Answers 2021 will provide you with information on basic data science interview questions and other information.

The art of discovering insights and trends in data has been around since the beginning of time. The ancient Egyptians used census data to increase tax collection efficiency, and they accurately predicted Nile river flooding every year. Since then, people who work in data science have carved out a distinct and distinct field for their work. This is the field of data science. Let us take a look at common data science interview questions below.

**Top 100 Data Science Interview Questions and Answers 2021**

- What is Data Science?

In a nutshell, data science is an interdisciplinary field of study that uses data for various research and reporting purposes in order to derive insights and meaning from that data. Data science necessitates a diverse set of skills, including statistics, business acumen, computer science, and others.

- What is Selection Bias?

A type of error that occurs when the researcher decides who will be studied is selection bias. It is usually associated with research in which the participants are not chosen at random. The selection effect is another name for it. It is a distortion of statistical analysis caused by the method of sample collection. If the selection bias is not considered, some of the study’s conclusions may be incorrect.

- What’s the distinction between point estimates and confidence intervals?

As an estimate of a population parameter, point estimation provides us with a specific value. Point Estimators for population parameters are derived using the Method of Moments and Maximum Likelihood estimator methods.

A confidence interval provides us with a range of values that are likely to contain the population parameter. The confidence interval is generally preferred because it tells us how likely it is that this interval will contain the population parameter. This likelihood or probability is known as the Confidence Level or Confidence Coefficient, and it is represented by the number 1 — alpha, where alpha is the level of significance.

- What Is the Law of Large Numbers?

It is a theorem that describes the outcome of repeating the same experiment many times. This theorem serves as the foundation for frequency-style reasoning. It states that the sample means, variance, and standard deviation converge to what they are attempting to estimate.

- What is Survivor ship Bias?

It is the logical error of focusing on aspects that aid in the survival of a process while casually ignoring those that did not work due to their lack of prominence. This can lead to incorrect conclusions in a variety of ways.

- What is TF/IDF vectorization?

The term frequency–inverse document refers to a document that has a frequency that is The term frequency, abbreviated as TF–IDF, refers to a numerical statistic that indicates the importance of a word in a collection or corpus to a document.

The TF–IDF value rises proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which explains why some words appear more frequently than others.

- What is Systematic Sampling?

Systematic sampling is a statistical technique that involves selecting elements from an ordered sampling frame. In systematic sampling, the list is progressed in a circular fashion, so when you reach the end of the list, it is restarted from the beginning. The best example of systematic sampling is the equal probability method.

- Explain Cross Validation.

Cross-validation is a model validation technique that is used to determine how statistical analysis results will generalize to a different data set. Typically used in situations where the goal is to forecast and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit over fitting and gain insight into how the model will generalize to an independent data set.

- What is pruning in Decision Tree?

Pruning is a technique used in machine learning and search algorithms to shrink decision trees by removing branches with little power to classify instances. So, when we remove sub-nodes from a decision node, we call this process pruning, which is the inverse of splitting.

- What is the distinction between Regression and Classification Machine Learning techniques?

Supervised machine learning algorithms include both regression and classification machine learning techniques. We must train the model using a labeled data set in the Supervised machine learning algorithm. During training, we must explicitly provide the correct labels, and the algorithm attempts to learn the pattern from input to output. If our labels are discrete values like A,B, etc., we have a classification problem; if our labels are continuous values like 1.23, 1.333, etc., we have a regression problem.

