Data engineering scenarios

Python ETL to pull data from MySQL to Oracle. Use cron to schedule the data pipeline

To pull data from MySQL to Oracle using Python ETL, you can use the following steps:

  1. Install the required python libraries — pymysql, cx_Oracle, and pandas. You can install them using pip.

  2. Connect to MySQL and Oracle databases using the appropriate credentials.

  3. Retrieve the data from MySQL database table using a SELECT query and store it in a pandas dataframe.

  4. Transform the data if required (e.g., clean up the data, add new columns, etc.)

  5. Connect to the Oracle database and create a table with the appropriate schema to store the data. This only applies if the table does not exist. If the table exist, ignore this part.

  6. Insert the data from the pandas dataframe into the Oracle database table.

  7. Schedule the ETL pipeline using cron.

Here’s a detailed step-by-step guide on how to create a Python ETL pipeline to extract data from MySQL and load it into Oracle, and schedule it using cron:

Step 1:- Install the required libraries to connect to MySQL and Oracle databases, and to work with data in Python, we need to install the following libraries:

pymysql — to connect to MySQL cx_Oracle — to connect to Oracle pandas — to work with data in Python

You can install these libraries using pip:

Here’s a sample code snippet to pull data from MySQL to Oracle using Python:

You can install these libraries using pip:

pip install pymysql cx_Oracle pandas

Step 2: Connect to MySQL and Oracle Databases To connect to MySQL and Oracle databases, we need to provide the following details:

MySQL hostname, username, password, and database name

Oracle hostname, port, SID, username, and password

Here is an example of how to connect to MySQL and Oracle:

import pymysql
import cx_Oracle

# Connect to MySQL
mysql_conn = pymysql.connect(
    host='mysql_host',
    user='mysql_user',
    password='mysql_password',
    db='mysql_database'
)
# Connect to Oracle
oracle_conn = cx_Oracle.connect(
    'oracle_user/oracle_password@oracle_host:port/oracle_sid'
)

Replace the placeholders with the actual values for your MySQL and Oracle databases.

Step 3: Retrieve Data from MySQL. To retrieve data from MySQL, we need to execute a SELECT query and store the result in a pandas DataFrame.

import pandas as pd

# Execute SELECT query and store result in a pandas DataFrame
query = "SELECT * FROM mysql_table"
df = pd.read_sql(query, mysql_conn)

Replace mysql_table with the name of the table from which you want to extract data.

Step 4: Transform data if required, we can perform data cleaning, data aggregation, or any other data transformation tasks on the data extracted from MySQL.

Here is an example of how to clean up the data by removing rows with missing values:

# Remove rows with missing values
df = df.dropna()

Replace this step with any data transformation tasks that are required for your use case.

Step 5: Create table in oracle. Before we can insert data into Oracle, we need to create a table with the appropriate schema to store the data.

# Create table in Oracle
cursor = oracle_conn.cursor()
cursor.execute("CREATE TABLE oracle_table (col1 datatype1, col2 datatype2, ...)")

Replace oracle_table with the name of the table that you want to create in Oracle, and col1 datatype1, col2 datatype2, … with the schema of the table.

Step 6: Insert data into oracle. After creating the table in Oracle, we can insert the data from the pandas DataFrame into the table.

# Insert data into Oracle table
for index, row in df.iterrows():
    cursor.execute("INSERT INTO oracle_table (col1, col2, ...) VALUES (:1, :2, ...)", tuple(row))

Replace oracle_table with the name of the table that you created in Oracle, and col1, col2, … with the names of the columns in the table.

Step 7: Schedule ETL Pipeline using cron. To schedule the ETL pipeline using cron, we need to create a shell script that runs the Python script, and add an entry to the crontab file on the server.

Here is an example of a shell script that runs the Python script:

#!/bin/bash

# Activate virtual environment if necessary
source /path/to/venv/bin/activate

# Run Python script
python /path/to/etl_script.py

Here is the complete Python ETL pipeline code to extract data from MySQL and load it into Oracle

Python ETL pipeline code:

import pymysql
import cx_Oracle
import pandas as pd
# Connect to MySQL
mysql_conn = pymysql.connect(host='mysql_host',
                             user='mysql_user',
                             password='mysql_password',
                             db='mysql_database')
# Connect to Oracle
oracle_conn = cx_Oracle.connect('oracle_user/oracle_password@oracle_host:port/oracle_sid')
# Execute SELECT query and store result in a pandas dataframe
query = "SELECT * FROM mysql_table"
df = pd.read_sql(query, mysql_conn)
# Remove rows with missing values
df = df.dropna()
# Create table in Oracle
cursor = oracle_conn.cursor()
cursor.execute("CREATE TABLE oracle_table (col1 datatype1, col2 datatype2, ...)")
# Insert data into Oracle table
for index, row in df.iterrows():
    cursor.execute("INSERT INTO oracle_table (col1, col2, ...) VALUES (:1, :2, ...)", tuple(row))
# Commit changes to Oracle database
oracle_conn.commit()
# Close connections
mysql_conn.close()
oracle_conn.close()

To schedule this Python script using cron, follow these steps:

  1. Open the crontab file by running the command crontab -e in the terminal.

  2. Add a new line to the file with the following syntax:

* * * * * /path/to/python /path/to/etl_script.py
  1. Replace /path/to/python with the path to your Python interpreter, and /path/to/etl_script.py with the path to your ETL pipeline script.

  2. The five asterisks (* * * * *) represent the schedule time. Here's a quick guide to help you set the schedule time:

  • * (asterisk) means every unit of time. For example, * * * * * means every minute.

  • 0 to 59 represents minutes.

  • 0 to 23 represents hours.

  • 1 to 31 represents days.

  • 1 to 12 represents months.

  • 0 to 6 represents days of the week (0 is Sunday).

  1. For example, to run the ETL pipeline every day at 1:00 AM, you can set the schedule time to 0 1 * * *.

  2. Save and exit the crontab file.

You can work on this code sinppet and see if you can implement an ETL to assist you. The ETL will run automatically according to the schedule you set in the cron job.

import pymysql
import cx_Oracle
import pandas as pd

# Connect to MySQL
mysql_conn = pymysql.connect(host='mysql_host',
                             user='mysql_user',
                             password='mysql_password',
                             db='mysql_database')

# Connect to Oracle
oracle_conn = cx_Oracle.connect('oracle_user/oracle_password@oracle_host:port/oracle_sid')

# Retrieve data from MySQL
query = "SELECT * FROM mysql_table"
df = pd.read_sql(query, mysql_conn)

# Transform data
# ...

# Create table in Oracle
cursor = oracle_conn.cursor()
cursor.execute("CREATE TABLE oracle_table (col1 datatype1, col2 datatype2, ...)")

# Insert data into Oracle table
for index, row in df.iterrows():
    cursor.execute("INSERT INTO oracle_table (col1, col2, ...) VALUES (:1, :2, ...)", tuple(row))

oracle_conn.commit()

# Schedule ETL pipeline using cron
# ...

To schedule the ETL pipeline using cron, you can create a shell script that runs the Python script and add an entry in the crontab file. For example, if you want to run the Python script every day at 2:30 am, you can create a shell script like this:

#!/bin/bash
python /path/to/etl_script.py

And add the following entry to the crontab file:

30 2 * * * /path/to/shell_script.sh

This will run the shell script every day at 2:30 am, which in turn runs the Python script to pull data from MySQL to Oracle.