Guide to Time-Series Analysis in Python

One of the numerous ways software engineers add value to an org is by performing time-series analysis. This powerful technique allows us to extract valuable insights from temporal data and consists in analyzing and making predictions based on time-based patterns.

In this blog post, we will delve into the world of time-series analysis using Python, often considered the go-to programming language for data analysis. Python offers a rich library and tools ecosystem, making it an ideal choice for working with time-series data.

However, using Python with a powerful time-series database like Timescale can speed up and simplify your data analysis. See our Python quick start to leverage Timescale’s fast queries, performance, and features, or keep reading for more info and a step-by-step guide.

Now, back to Python.

An Example of Time-Series Analysis With Python

Python has quickly emerged as a preferred tool for data analysis due to its simplicity, versatility, and vast community support. With its intuitive syntax and extensive library ecosystem, this elegant programming language allows you to tackle complex problems efficiently.

Whether you are building a data-intensive application or working with an experienced data scientist, Python provides a robust platform for exploring, visualizing, and modeling time-dependent data.

Let's see how Python can empower your work with time-series data. Consider the following example code snippet that loads a time-series dataset using pandas and plots it using Matplotlib:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Generate random time-series data
np.random.seed(42)
dates = pd.date_range(start='2022-01-01', periods=100, freq='D')
values = np.random.randn(100).cumsum()

# Create a DataFrame from the generated data
data = pd.DataFrame({'date': dates, 'value': values})

# Set the 'date' column as the index
data.set_index('date', inplace=True)

# Plot the time-series data
plt.plot(data.index, data['value'])
plt.xlabel('Time')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

In this example, a random time-series dataset is generated using NumPy's random number generator. The dataset consists of 100 dates, starting from January 1, 2022, and corresponding random values. The data is then converted into a Pandas DataFrame, and the 'date' column is set as the index. Finally, the time-series data is plotted using Matplotlib, displaying the variation of the 'value' over time.

Why Use Python for Time-Series Data Analysis?

Python brings a host of benefits to the table regarding time-series analysis:

It is a user-friendly language.
It is widely available in the open-source world.
It has extensive library support.
It can reuse existing code.

Let’s dig into these advantages.

Python is easy to use

Python is known for its simplicity and user-friendliness. It has an intuitive syntax that makes it easy to learn, even for beginners. The clean structure of Python code promotes efficient coding practices, allowing you to focus on analyzing time-series data rather than grappling with complex programming concepts.

Python is open source

One of the great advantages of Python is that it's an open-source language. This means it is freely available to use and is continuously improved and supported by a vibrant community of developers. The open-source nature of Python enables data scientists to access a wealth of resources, tools, and libraries for analyzing time-series data without incurring additional costs.

Python offers extensive library support

Python offers an extensive collection of specialized libraries and tools specifically designed for time-series analysis. These libraries, such as pandas, NumPy, statsmodels, and scikit-learn, provide various functions and tools tailored to the unique challenges of working with time-dependent data. They simplify complex operations, allowing you to focus on extracting meaningful insights rather than reinventing the wheel.

Python facilitates code reusability

Thanks to its longevity and widespread adoption, Python has a vast codebase that data scientists and application developers can leverage for their time-series analysis needs.

Many common tasks, such as data loading, cleaning, transformation, and visualization, have already been implemented and shared by the Python community. This allows you to save time and effort by building upon existing code and solutions, accelerating the analysis process.

Plotting Data Using Pyplot

Plotting time-series data is an essential step in visualizing patterns, trends, and anomalies. Python provides the Matplotlib library, which includes the Pyplot module for creating various types of plots, including line plots, scatter plots, and histograms.

To illustrate this, let's create a random dataset and plot it using Pyplot:

import numpy as np
import matplotlib.pyplot as plt

# Generate random time-series data
np.random.seed(0)
dates = pd.date_range(start='2023-01-01', periods=100)
values = np.random.randn(100).cumsum()

# Plot the time-series data
plt.plot(dates, values)
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

Time-Series Analysis Tasks in Python

Performing time-series analysis involves examining historical data to uncover patterns, trends, and other valuable insights. It is a crucial step in understanding the behavior of time-dependent data and making predictions for the future. Time-series analysis encompasses numerous techniques, such as trend analysis, seasonality detection, forecasting, and anomaly detection.

✨

Editor’s Note: If you want to learn how you can use Timescale to analyze historical data for speedy client-facing analytics dashboards, read Octave’s story (whose team built most of their backend software using Python).

Identifying data trends and patterns in Python

In Python, there are various techniques available to analyze data for trends and patterns. These techniques enable data scientists and developers to gain valuable insights into the underlying characteristics of their datasets. Understanding data trends and patterns is crucial for making informed decisions and predictions based on the available information.

Stationarity

Stationarity refers to a key concept in time-series analysis where the statistical properties of a dataset, such as mean and variance, remain constant over time. In the context of Python, testing for stationarity involves methods like the Augmented Dickey-Fuller (ADF) test, Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, and visual inspection of time series plots. Here's an example of how to test for stationarity in Python using the ADF test:

from statsmodels.tsa.stattools import adfuller

# Assuming 'data' is the time series data
result = adfuller(data)
print('ADF Statistic:', result[0])
print('p-value:', result[1])

Seasonality

Seasonality pertains to recurring patterns or fluctuations in a time series that occur at regular intervals. It introduces predictable variations in the data over specific periods. Distinguishing between seasonality and stationarity is essential, as they represent different aspects of time series behavior.

Testing for seasonality in Python can be accomplished through techniques such as decomposition analysis and autocorrelation function (ACF) plots. An example of testing for seasonality involves decomposing the time series and analyzing the seasonal component visually.

Autocorrelation and partial autocorrelation

Autocorrelation measures the relationship between a variable's current value and its past values at different time lags. On the other hand, partial autocorrelation quantifies the direct relationship between a variable's current value and its past values, excluding the influence of intermediate-lagged variables.

In Python, testing for autocorrelation and partial autocorrelation often involves plotting the autocorrelation function (ACF) and partial autocorrelation function (PACF) and observing the patterns. Here's an example of how to visualize autocorrelation and partial autocorrelation in Python:

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Assuming 'data' is the time series data
plot_acf(data)
plot_pacf(data)
plt.show()

Predicting future values based on historical data

Python offers a variety of libraries and techniques for time-series forecasting, and one popular method is the autoregressive integrated moving average (ARIMA) model. ARIMA is a powerful and widely used approach that combines the three following components to capture the patterns and trends in time-series data:

1. Autoregression (AR)

2. Differencing (I)

3. Moving Average (MA)

You can utilize the statsmodels.tsa.arima.model to apply the ARIMA model in Python.ARIMA class. This class allows you to specify the order of the AR, I, and MA components and fit the model to your historical data. Once the model is fitted, you can use it to forecast future values by calling the predict method and specifying the start and end dates for the forecast period.

Types of forecasting models

In time-series analysis, various forecasting models are available to predict future values based on historical data. Each model has its own strengths, limitations, and suitability for different types of time-series data. Let's explore some common types of forecasting models:

Moving Average (MA)

The moving average model calculates the average of past observations to forecast future values. It helps eliminate short-term fluctuations and identify underlying trends in the data.

The moving average model can be implemented using the rolling function in pandas, which calculates the mean over a specified window of past observations. Here's a simplified example:

import pandas as pd

# Assuming 'data' is the time series data
window_size = 3moving_avg = data.rolling(window=window_size).mean()

Autoregressive (AR)

The autoregressive model uses past observations and a linear regression equation to predict future values. It assumes that the future values depend on the previous values with a lag.

The autoregressive model can be implemented using the AR class from the statsmodels library, which enables fitting an autoregressive model to the time series data. Here's an example:

from statsmodels.tsa.ar_model import AR

# Assuming 'data' is the time series data
model = AR(data)
ar_model = model.fit()
predictions = ar_model.predict(start=len(data), end=len(data)+n)  # Replace n with the number of future values to predict

Autoregressive Moving Average (ARMA)

The ARMA model combines the autoregressive and moving average models, considering both past observations and the average of past errors to make predictions.

The autoregressive moving average model can be implemented using the ARMA class from the statsmodels library, which allows fitting an ARMA model to the time series data. Here's an example:

from statsmodels.tsa.arima.model import ARIMA

# Assuming 'data' is the time series data
model = ARIMA(data, order=(p, d, q))  # Replace p, d, and q with appropriate values
arma_model = model.fit()
predictions = arma_model.predict(start=len(data), end=len(data)+n, typ='levels')  # Replace n with the number of future values to predict

Autoregressive Integrated Moving Average (ARIMA)

The ARIMA model extends the ARMA model by incorporating differencing to make the time series stationary. It is suitable for non-stationary data with trends and seasonality.

The autoregressive integrated moving average model can also be implemented using the ARIMA class from the statsmodels library. Here's an example:

from statsmodels.tsa.arima.model import ARIMA

# Assuming 'data' is the time series data
model = ARIMA(data, order=(p, d, q))  # Replace p, d, and q with appropriate values
arima_model = model.fit()
predictions = arima_model.predict(start=len(data), end=len(data)+n, typ='levels')  # Replace n with the number of future values to predict

Exponential Smoothing

Exponential smoothing models apply weights to past observations, giving more importance to recent values. Different variations of exponential smoothing, such as Simple Exponential Smoothing (SES), Holt's Linear Exponential Smoothing, and Holt-Winters Exponential Smoothing, accommodate different patterns in the data.

Exponential smoothing models can be implemented using the SimpleExpSmoothing, ExponentialSmoothing, and Holt classes from the statsmodels library. Here's an example of Simple Exponential Smoothing (SES):


from statsmodels.tsa.holtwinters import SimpleExpSmoothing

# Assuming 'data' is the time series data
model = SimpleExpSmoothing(data)
ses_model = model.fit()
predictions = ses_model.forecast(steps=n)  # Replace n with the number of future values to forecast

Seasonal ARIMA (SARIMA)

SARIMA is an extension of the ARIMA model that accounts for seasonal patterns in the data. It includes additional terms to capture seasonality, making it suitable for time-series data with recurring patterns.

The seasonal ARIMA model can be implemented similarly to the ARIMA model but with additional seasonal parameters. Here's a simplified example:

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Assuming 'data' is the time series data
model = SARIMAX(data, order=(p, d, q), seasonal_order=(P, D, Q, s))  # Replace p, d, q, P, D, Q, and s with appropriate values
sarima_model = model.fit()
predictions = sarima_model.forecast(steps=n)  # Replace n with the number of future values to forecast

These are just a few examples of forecasting models commonly used in time-series analysis. To explore more details about each model, including their mathematical formulations, strengths, limitations, and suitable use cases, read our blog post, What Is Time-Series Forecasting?

Extracting useful features for machine learning/deep learning algorithms

Extracting meaningful features from time-series data is also crucial for training machine learning or deep learning models. These features serve as inputs to the models and help capture relevant patterns and characteristics in the data. Python provides various techniques and libraries to extract useful features for time-series analysis.

One common approach is to apply feature engineering techniques to transform raw time-series data into informative features. Let's consider an example of training a machine learning model to detect potential risk for a heart attack based on heart rate data. Time-series analysis techniques can help us extract meaningful insights from the heart rate data that can be used as inputs to the model.

For instance, we can calculate statistical measures such as the mean and standard deviation of the heart rate. These measures provide information about the central tendency and variability of the heart rate data, respectively.

Additionally, we can compute other features like autocorrelation, which measures the correlation between the heart rate values at different time lags. By feeding these features into a machine learning model, we can make accurate predictions on the risk level for potential heart attacks.

Data cleaning

Data cleaning plays a crucial role in the analysis of time series as it ensures the accuracy and reliability of the data used for further analysis and modeling. In Python, you can leverage various techniques and libraries to clean time-series data and handle common issues, such as missing values, outliers, or inconsistencies.

Data cleaning in time-series analysis typically involves the following steps:

Handling missing values: Missing values can occur in time-series data due to various reasons, such as sensor failures, data transmission issues, or human errors. Python provides libraries like pandas that offer methods to handle missing values, such as interpolation, forward filling, backward filling, or dropping rows with missing values.
Outlier detection and treatment: Outliers are extreme values that deviate significantly from the normal patterns in the time series. Identifying and handling outliers is important to avoid distortions in the analysis. Python libraries like pandas, NumPy, or scikit-learn provide techniques to detect and handle outliers, such as statistical methods or machine learning-based approaches.
Dealing with inconsistent or incorrect data: Time-series data may sometimes contain inconsistent or incorrect values, such as inconsistent units, invalid data types, or data entry errors. Python offers functionalities to clean and correct such data inconsistencies, including data type conversion, data normalization, or applying business rules to identify and correct erroneous data.

By performing these data cleaning steps, you can ensure the quality and reliability of the time-series data for accurate and meaningful analysis. Another way to do this is by using Timescale. 😎 In one of our previous blog posts, we made a side-by-side comparison between Timescale (built on PostgreSQL) and Python for data cleaning.

Challenges in Working With Time-Series Data in Python

Still, working with time-series data in Python can pose some challenges, especially when dealing with large datasets. In this section, we will explore two common challenges in working with time series and discuss strategies to overcome them.

Loading data quickly and efficiently

Loading large time-series datasets efficiently is crucial for smooth data analysis. Python provides several libraries like pandas and NumPy that offer efficient data structures and tools for handling time-series data.

To load data quickly, consider using pandas' read_csv function with optimized parameters. For example, specifying the appropriate data types for each column can significantly speed up the loading process. Additionally, using compression techniques like gzip or parquet files can reduce file size and improve loading performance.

Another approach to enhance data loading speed is to leverage parallel processing. This can be achieved by utilizing multiprocessing techniques, where the data analysis is split across multiple machines, and the results are combined in the final step. Libraries like Dask and Apache Spark provide distributed computing capabilities, allowing you to distribute the load across multiple machines, thus accelerating the data loading and analysis process.

Handling large datasets

When dealing with gigabytes (or terabytes) of time-series data, parsing and processing all of it can become challenging. In such cases, it is essential to have a strategy to handle large datasets efficiently.

You can use multiprocessing techniques, as mentioned earlier. By splitting the analysis across multiple machines and combining the results afterward, you can distribute the workload and process the large datasets quicker.

Another option is to leverage distributed computing frameworks like Apache Spark. Spark allows you to spread the data processing across a cluster of machines, enabling efficient handling of large-scale time-series datasets. Spark's parallel processing capabilities and built-in data processing functions make it a powerful tool for managing and analyzing big time-series data.

Working With Time Series in Python

Working with time-series data in Python involves several key steps, from choosing the right time-series library to loading and analyzing the data. Let’s explore the essential aspects of working with time series in Python, such as selecting a time-series library, utilizing the core library pandas for data loading, analysis, and visualization, and looking into some more specialized libraries for advanced time-series tasks.

Choosing a time-series library

Python provides various libraries tailored for time-series analysis. The core library for time-series analysis in Python is pandas. Pandas provides efficient data structures and functions to handle time series effectively. It allows you to load data from diverse sources, such as CSV files and databases like Timescale.

With pandas, you can perform basic analysis and visualization of time-series data. The central data structure in pandas is the DataFrame, which serves as the primary unit for representing time-series data.

Using pandas, you can load time-series data from various sources with ease. Functions like read_csv() and read_sql() enable you to load data into a DataFrame for further analysis. This flexibility allows you to work with data from different formats and platforms.

Pandas provides a rich set of functionalities for analyzing and visualizing time-series data. You can perform various operations, including data aggregation, filtering, and computing summary statistics. Additionally, pandas integrates well with visualization libraries like Matplotlib and Seaborn, allowing you to create insightful plots and charts to explore patterns and trends in the data.

Here's an example that demonstrates the steps of loading and working with time-series data using pandas in Python:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Load time-series Data
dates = pd.date_range(start='2023-01-01', periods=100)
values = np.sin(np.linspace(0, 2*np.pi, 100))
data = pd.DataFrame({'Date': dates, 'Value': values})

# Step 2: Perform Data Analysis
# Calculate summary statistics
summary_stats = data.describe()

# Filter data based on specific conditions
filtered_data = data[data['Value'] > 0]

# Resample data to a different frequency
resampled_data = data.resample('1W', on='Date').sum()

# Step 3: Visualize time-series Data
plt.plot(data['Date'], data['Value'])
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation = 45)
plt.title('Time Series Data')
plt.show()

This code generates a time-series dataset with dates and sine wave values. It performs data analysis tasks such as calculating summary statistics, filtering data based on conditions, and resampling the data to a different frequency. Finally, it visualizes the time-series data by plotting the values against the dates.

To delve deeper into pandas and its functionalities, you can refer to the official pandas documentation.

In addition to pandas, there are specialized libraries that can enhance your time-series analysis capabilities:

sktime: sktime is a library that trains multiple time-series models and connects to related libraries, enabling advanced modeling and analysis of time-series data.
pdmarima: pdmarima is a library used for calculating ARIMA (AutoRegressive Integrated Moving Average) models, a popular technique for time-series forecasting and analysis.
tsfresh: tsfresh is a library specifically designed for feature extraction from time-series data. It provides a wide range of algorithms and techniques to extract meaningful features that can be used in machine learning and predictive modeling.

By leveraging these libraries, you can efficiently work with time-series data, perform advanced analysis, and extract valuable insights.

Obtain and store time-series data

Before diving into time-series analysis, it is essential to define the sources from which you'll gather your data. Consider the following factors:

Determine the purpose of your analysis and identify the specific data requirements.
Explore available data sources such as public repositories, databases, APIs, or data collected from your own applications.
Ensure data integrity and accuracy by choosing reliable and reputable sources.
Consider the frequency at which you'll update your data to ensure it remains up-to-date.

There are several possibilities for obtaining and storing time-series data, depending on your specific requirements. Here are a few common approaches:

Loading existing datasets: If you have time-series data stored in CSV or other flat file formats, you can use libraries like pandas or Timescale to load and manipulate the data. These libraries provide flexible functions to read data from files and perform various operations on it. You can find more information about loading data into Timescale in the Timescale documentation.

Here's an example of how to load time-series data from a CSV file using pandas:

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('path/to/your/file.csv')

In the example above, you need to provide the file path of your CSV file as an argument to the read_csv() function. Pandas will automatically parse the file and create a DataFrame containing the time-series data.

Obtaining data from public or private APIs: Many organizations provide APIs that allow access to their time-series data. For example, weather data APIs provide historical and real-time weather information. You can use libraries like requests or specialized Python packages to interact with these APIs and retrieve the desired time-series data.

Writing dynamically from your own apps: If you have your own applications generating time-series data, you can write code to capture and store it dynamically. For example, you can track user login activity or user purchase activity on a website and store it in a database or file.

Load and analyze time-series data in Python

To load and analyze time-series data in Python, you can utilize various libraries and formats based on your specific requirements. One popular choice is using pandas, a powerful data manipulation library that provides a convenient way to load, transform, and analyze time-series data.

I’ve already created the table Weather in my database with two columns: date and temperature, as shown below:

You can follow the below steps to load data from Timescale into pandas and perform time-series feature extraction using tsfresh.

Install the required libraries, such as psycopg2, pandas, tsfresh.
Import the necessary modules as shown below:


import psycopg2
from psycopg2 import sql
import pandas as pd
from tsfresh import extract_features

3. Establish a connection to your Timescale instance:

con = psycopg2.connect(
    host='your_host',
    port='your_port',
    database='your_database',
    user='your_username',
    password='your_password'
)

4. Create a cursor object associated with your established connection to Timescale. It allows you to interact with the database and execute SQL queries.

cursor = con.cursor()

5. Create an SQL query object using the sql.SQL class from the psycopg2 module. This allows you to safely construct SQL queries with placeholders for parameter values.

LIMIT = 10;
query = sql.SQL(f"SELECT * FROM Weather LIMIT {LIMIT}")
cursor.execute(query)

The SELECT * FROM Weather query is a simple SQL query that retrieves all rows and columns from the "Weather" table. However, instead of fetching all records from the table, it limits the result set to a maximum of 10 records using the LIMIT clause.

After creating the query object, the line cursor.execute(query) executes the SQL query using the object cursor that you previously created. It sends the query to the Timescale database for execution.

6. Fetch all the rows that are available from the query execution.

# Fetch the results
results = cursor.fetchall()

7. You can iterate over each row in the results list and print the row.

# Do something with the results
for row in results:
    print(row)

8. Extract data from the results list, which contains rows fetched from the database, and store them in separate lists, dates, and values.

dates = []
values = []

for date, value in results:
    dates.append(date)
    values.append(value)

The for loop iterates over each row in the results list, where each row is represented as a tuple containing two elements: date and value. By using the syntax for date, value in results, you can directly unpack the tuple into two separate variables: date and value.

Inside the loop, the date and value variables are appended to the respective lists' dates and values using the append() method. This allows you to store the values separately for further processing or analysis.

9. Create a pandas DataFrame from the extracted data and perform feature extraction using the extract_features function from the tsfresh library.


# Create a pandas DataFrame from the extracted data
data_df = pd.DataFrame({'date': dates, 'temperature': values})
data_df['id'] = ""
# Perform feature extraction
extracted_features = extract_features(data_df, column_id='id', column_sort='date')

10. Iterate over the columns of the extracted_features DataFrame and print information about each column and its corresponding data. For instance, you can examine the names and data of any of the five columns.

columns = extracted_features.columns

for column in columns[5:10]:
    print(f"Column: {column}")
    print(f"Data: {extracted_features[column][0]}")
    print()

This example demonstrates how to load time-series data from Timescale into pandas and extract features using the tsfresh library. Timescale provides efficient storage and retrieval of time-series data, optimized for time-based queries. It offers advantages such as:

Hypertables: Timescale introduces the concept of hypertables, partitioned tables that automatically divide data into smaller chunks based on time intervals. This partitioning allows for parallelism and optimized query execution. In the above example, the "Weather" table is likely a hypertable, enabling faster loading and retrieval of time-series data.
Time-Series Functions: Timescale provides a rich set of time-series functions and extensions for advanced analytics on time-based data. In the above code, the extract_features function from the tsfresh library leverages Timescale's capabilities to perform feature extraction on the time-series data.
Scalability: Timescale is designed to scale effortlessly as the time-series data grows. It can handle massive amounts of data while maintaining high performance. This scalability ensures that your time-series analysis remains efficient despite increasing data volumes.

Unleash the Power of Timescale for Time-Series Data

Timescale offers powerful time-series-specific SQL functions and extensions that enable efficient operations on data, such as querying, filtering, and aggregating time-series data. Timescale is optimized for handling large-scale time-series data, providing efficient storage, compression, and indexing techniques.

It also offers built-in functionalities that can replace or complement Python libraries for data cleaning and preprocessing. So, ready to supercharge your time-series data management?

With Timescale, you can execute Python code directly in the database, leveraging popular data packages like tsfresh for feature extraction and analysis. Create a free Timescale account today.