Introduction to Statistical Analysis

Statistics plays a crucial role in data science as it provides the foundational principles and methods for analyzing and making sense of data. Data science relies heavily on statistical analysis to comprehend and model financial markets. Due to the inherent complexity and noise of financial data, statistical techniques are essential for gaining relevant insights and making well-informed decisions. Here are some key aspects of how statistics are used in data science in the context of financial markets:

  1. Descriptive Statistics: Descriptive statistics are a set of methods used to summarize and describe the main features of a dataset, such as its central tendency, variability, and distribution. These methods provide an overview of the data and help identify patterns and relationships.
  2. Inferential Statistics: Inferential statistics help data scientists make predictions or draw conclusions about a larger population based on a sample of data. Techniques like hypothesis testing, confidence intervals, and regression analysis fall under this category.
  3. Time Series Analysis: Time series analysis is a statistical method for analyzing data that is collected over time. It is used to understand the patterns and trends in the data and to make predictions about the future. Here are some examples of how time series analysis can be used in financial markets:
    • Traders might use time series analysis to identify patterns in the price movements of a stock. This information could then be used to make trading decisions.
    • Portfolio managers might use time series analysis to forecast the performance of different asset classes. This information could then be used to construct a diversified portfolio.
    • Risk managers might use time series analysis to assess the risk of different investments. This information could then be used to make decisions about how to allocate capital.
  4. Data Exploration: Statistics are used to explore data visually through techniques like histograms, scatter plots, box plots, and more. These visualizations provide insights into data distributions and relationships.
  5. Machine Learning: Many machine learning algorithms are built on statistical principles. For instance, linear regression and decision trees incorporate statistical concepts to make predictions.
  6. Bayesian Statistics: Bayesian methods are used for probabilistic modeling and updating beliefs as new data becomes available. Bayesian inference is used in various applications, including recommendation systems and natural language processing.

In summary, statistics provide the mathematical framework for data scientists to explore, analyze, and draw meaningful insights from data. It is an essential tool in the data science toolbox, helping professionals make data-driven decisions and solve real-world problems.

Some of the common statistical terms widely used in the financial world

  1. Mean: It is the most common measure of central tendency and is calculated by summing all the values in a dataset and dividing by the number of values. It can be used to summarize the performance of a stock, a sector, or the entire market over a period of time. Here are some examples of how the mean can be used in financial markets:
    • Traders might use the mean to calculate the average price of a stock over a period of time, such as a week or a month. This information could then be used to identify trends in the stock’s price and make trading decisions.
    • Portfolio managers might use the mean to calculate the average return of a portfolio of stocks over a period of time. This information could then be used to evaluate the performance of the portfolio and make decisions about how to allocate capital.
    • Risk managers might use the mean to calculate the average volatility of a stock or a portfolio of stocks over a period of time. This information could then be used to assess the risk of the investment and make decisions about how to hedge against risk.
  2. Median: The median is another important descriptive statistic used in financial markets. The median is the middle value in a set of data that has been sorted in ascending or descending order. If there are two middle values, the median is the mean of those two values. The median is less sensitive to outliers than the mean, which is why it is often preferred when analyzing financial data. Outliers are extreme values that can skew the results of the mean. For example, if a few stocks in a portfolio have very high returns, the mean return of the portfolio will be skewed high. However, the median return of the portfolio will be less affected by these outliers.
  3. Standard Deviation: The standard deviation measures the dispersion or volatility of data points around the mean. In finance, it is a critical measure of risk. Higher standard deviation implies higher price volatility, which can be both an opportunity and a risk for investors.
    • One standard deviation separates 68% of the values.
    • Within a 2 standard deviation range, 95% of values are found.
    • The standard deviation of 99.7% of values is 3.

4. Variance: It is easy to estimate dispersion using variance. The dataset’s variance indicates how far each number is from the mean. Calculate the mean and squared deviations from the mean before computing variance.

5. Skewness: Skewness is essentially a commonly used measure in descriptive statistics that characterizes the asymmetry of a data distribution. When applying data science, it is essential to comprehend the shape of the data, analyze the outliers in a set of data, and figure out where the most information is located.

A longer tail on the right side of the distribution is indicated by positive skewness, whereas a longer tail on the left is indicated by negative skewness. Understanding a dataset’s shape and outliers is made easier by skewness.

Depending on the model, skewness might lower the interpretation of feature relevance or break model assumptions if the values of a particular independent variable (feature) are skewed.

Skewness, which differs from the symmetrical normal distribution (bell curve), is a measure of asymmetry found in a probability distribution observed in statistics.

Typical values in this skewed dataset range from the first quartile (Q1) to the third quartile (Q3).

To determine skewness, one can use the normal distribution. Data are symmetrically distributed when discussing the normal distribution. Because all measurements with a central tendency fall in the middle, the symmetrical distribution has no skewness.

Both the left and right sides of symmetrically distributed data have an equal number of observations. The left side has 45 observations, and the right side also has 45 observations (if the dataset contains 90 values). But what if the distribution is not symmetrical? This type of data is referred to as asymmetrical data.

Types of Skewness

Positive Skewed or Right-Skewed  (Positive Skewness): A positively skewed distribution in statistics has a long right tail. Unlike symmetrically distributed data, where the mean, median, and mode are all equal, this type of distribution has measurements that are dispersed. As a result, a distribution is said to be positively skewed if its mean, median, and mode are all positive rather than zero or negative.

Positively skewed data have a mean that is higher than the median, with a lot of data pushed to the right. In other words, the outcomes are skewed to the positive side. Since the mean is the most frequent value and the median is the middle value, the mean will always be greater than the median.

Extremely positive skewness is undesirable in a distribution because it might lead to inaccurate results. Skewed data can be brought closer to a normal distribution with the aid of data transformation techniques. The most well-known transformation for positively skewed distributions is the log transformation. The log transformation proposes taking the natural logarithm of each value in the dataset.

Negative Skewed or Left-Skewed (Negative Skewness): The extreme opposite of a positively skewed distribution is a left-skewed distribution, which has a long left tail. In statistics, a negatively skewed distribution is a distribution model in which the majority of data are plotted on the graph’s right side while the distribution’s tail spreads out to the left. When data is negatively skewed, it has a mean that is lower than its median, with a lot of data pushed to the left. A distribution is said to be negatively skewed if the distribution’s mean, median, and mode are negative rather than positive or zero.

The mode is the most frequently occurring value, whereas the median is the middle value. The median will be greater than the mean in an imbalanced distribution.

A good rule of thumb is that the data are nearly symmetrical if the skewness is between -0.5 and 0.5. The data are significantly skewed if the skewness is between -1 and -0.5 (negatively skewed) or between 0.5 and 1 (positively skewed). The data are considered highly skewed if the skewness is less than -1 (negatively skewed) or greater than 1 (positively skewed).

6. Kurtosis: Kurtosis is a statistical term that characterizes the shape of a probability distribution. When compared to a normal distribution, it provides information about the distribution’s tails and peaks. Negative kurtosis suggests lighter tails and a flatter distribution, while positive kurtosis suggests heavier tails and a more peaked distribution. Kurtosis aids in the analysis of a dataset’s properties and outliers.

The tailedness of a distribution is referred to as the Kurtosis measure, and the frequency of the outliers is referred to as the tailness. The peaking of a data distribution is determined by the degree to which data values are clustered around the mean. High kurtosis datasets typically contain large tails, a sharp peak close to the mean, and a quick drop. In contrast, low kurtosis datasets usually have a flat top near the mean instead of a sharp peak.

Kurtosis in Finance:

Kurtosis is a metric for financial risk used in finance. Significant kurtosis suggests a high likelihood of both extremely large and extremely tiny returns, which is correlated with a high level of risk for an investment. In contrast, small kurtosis indicates a moderate amount of risk because the likelihood of extreme returns is relatively low.

Excess Kurtosis

In probability and statistics, excess kurtosis is used to compare the kurtosis coefficient to that of a normal distribution. Leptokurtic distributions have positive excess kurtosis, platykurtic distributions have negative excess kurtosis, and mesokurtic distributions have excess kurtosis close to zero. Since the kurtosis of normal distributions is 3, excess kurtosis is computed by subtracting 3 from the kurtosis.

Types of Kurtosis

  • Leptokurtic or Heavy-Tailed Distribution (kurtosis greater than a normal distribution).
  • Mesokurtic (kurtosis same as a normal distribution).
  • Platykurtic or Short-Tailed Distribution (kurtosis less than a normal distribution).