Introduction to Statistics for Data Analysis
Table of contents
Goals of Data Analysis
There are three main goals of data analysis:
- Summarize the data
- Describe the data
- Predict the data
Summarize the Data
An example of summarizing the data is to calculate the mean of a variable.
Describe the Data
This is more like understanding the characteristics of the data and the relationships between variables.
Questions like “Was this effective?” or “Is there a relationship between ~” fit here.
In data analysis, there are two main types of relationships: causality and correlation.
Causality
- Relationship of cause and effect: if-then
- If you know the cause, you can not only predict the effect, but also control the outcome by intervening the cause.
- e.g. If you take this treatment, you will get better.
Correlation
- Relationship of tendency: when one is of a certain state, then the other tends to be of a certain state.
- Different from causality in that changing one variable does not necessarily change the other.
- e.g. People with higher income tend to be happier, but giving people more money does not necessarily make them happier since other factors come into play.
Predict the Data
Based on data already collected, you predict the next data.
Purpose of Statistics in Data Analysis
Dispersion is the degree of spread of the data.
Higher the dispersion, the more difficult it is to really understand the data, because it bring uncertainty to the relationship between variables.
Observations that abide the laws of Physics are relatively deterministic, hence have low dispersion.
This uncertainty is the reason we need statistics.
Probability is the language of uncertainty.
Types of Statistics
Descriptive Statistics
Descriptive statistics is the type of statistics that summarizes and describes the data.
The main focus is on understanding and explaining the data itself.
Inferential Statistics
Inferential statistics aims to infer the characteristics of the source from which the data was collected.
To predict the data, we need to know understand the source.
One way to do this is to come up with a probability model of the source.
There are two main types of inferential statistics:
- Statistical inference
- Statistical testing
Statistical Inference
e.g. Throw a coin 100 times and count the number of heads. Then we can infer the probability of getting heads.
Statistical Testing
e.g. Hypothesize that the coin is fair. Then we can test the hypothesis by throwing the coin 100 times and see if the number of heads is close to 50.