Goals and Target of Data Analysis
Table of contents
Goals of Data Analysis
Three main goals of data analysis:
- Summarize the data: e.g. calculate the mean
- Describe the data
- Predict the data
Describe the Data
This is more like understanding the characteristics of the data and the relationships between variables.
Questions like “Was this effective?” or “Is there a relationship between ~” fit here.
In data analysis, there are two main types of relationships: causality and correlation.
Causality
- Relationship of cause and effect: if-then
- If you know the cause, you can not only predict the effect, but also control the outcome by intervening the cause.
- e.g. If you take this treatment, you will get better.
Correlation
- Relationship of tendency: when one is of a certain state, then the other tends to be of a certain state.
- Different from causality in that changing one variable does not necessarily change the other.
- e.g. People with higher income tend to be happier, but giving people more money does not necessarily make them happier since other factors come into play.
Purpose of Statistics in Data Analysis
Dispersion is the degree of spread of the data.
Higher the dispersion, the more difficult it is to really understand the data, because it brings uncertainty to the relationship between variables (observations that abide the laws of physics are relatively deterministic, hence have low dispersion).
This uncertainty is the reason we need statistics, and probability is the language of uncertainty.
Types of Statistics
Descriptive Statistics
The main focus is on understanding and explaining the data itself.
- Summarize the data
- Describe the data
Inferential Statistics
Aims to infer the characteristics of the source from which the data was collected.
To predict the data, we need to understand the source.
One way to do this is to come up with a probability model of the source.
There are two main types of inferential statistics:
- Statistical inference
- Statistical testing
Statistical Inference
e.g. Throw a coin 100 times and count the number of heads. Then we can infer the probability of getting heads.
Statistical Testing
e.g. Hypothesize that the coin is fair. Then we can test the hypothesis by throwing the coin 100 times and see if the number of heads is close to 50.
Target of Analysis
Before you start analyzing any data, clearly define:
- The goal of analysis
- The target of analysis
As described above, the goal may be to understand the data or to predict the data.
Once the goal is set, you need to define the target of analysis, aka the population.
e.g. If the goal is to confirm the effect of a certain treatment, then the target of analysis would be the entire population of people who have the related disease.
Population
The target of the analysis is called the population.
The number of elements in the population is called the population size.
There are two types of population depending on the size:
Finite population
No matter how large the population is, if the cardinality of the population can be fully enumerated, then it is a finite population.
Infinite population
If the cardinality of the population is infinite, then it is an infinite population.
Population that changes over time is also considered an infinite population.
Finding the characteristics of the population
If we know the characteristics of the population, then it becomes easier for us to predict the data.
Then how would we know the characteristics of the population?
Complete enumeration
- Possible only for finite populations
- Descriptive statistics can do the job
- Unrealistic for large populations
Sample survey
- Sampling is the process of selecting a subset of the population
- Inferential statistics will be used to infer the characteristics of the population from the sample
Sample size
The number of elements in the sample is called the sample size.
The sample size is usually denoted by:
$$ n $$
It is important to differentiate number of samples from sample size. Number of samples is the number of times you perform the sampling process, while sample size is the number of elements in the sample.