Building a Secure Data Pipeline with Apache NiFi: A Deep Dive
Building a Secure Data Pipeline with Apache NiFi: Integrating Veracode API and Sending Encrypted CSV via SMTP
Introduction:
In today’s data-driven world, organizations rely on robust and secure data pipelines to process and analyze large volumes of data. Apache NiFi emerges as a powerful open-source tool that offers a flexible and scalable solution for constructing such pipelines. This blog post aims to guide you through the process of building a secure data pipeline using Apache NiFi, with a focus on integrating the Veracode API for security scanning and sending the resulting CSV file as an encrypted attachment via SMTP.
To ensure the utmost security of sensitive information, we will explore each step of the workflow involved in constructing this secure data pipeline. By delving into the configurations of each NiFi processor utilized in the pipeline, we will provide a comprehensive understanding of their functionality. Furthermore, sample data and settings will be provided, enabling you to grasp the practical implementation of the pipeline in a real-world scenario.
The core components of this secure data pipeline revolve around data extraction from an Oracle database, performing security scanning using Veracode, and ultimately ensuring secure email delivery of the processed data. By comprehensively covering each stage of the workflow, we aim to equip you with the knowledge and practical insights necessary to construct your own secure data pipelines using Apache NiFi.
Let’s embark on this journey to explore the intricate details of building a robust and secure data pipeline, leveraging the capabilities of Apache NiFi, Veracode API integration, and secure email delivery mechanisms.
Prerequisites:
To follow along with this tutorial, make sure you have the following:
Apache NiFi installed and configured.
Access to Veracode API and its documentation.
An SMTP server with preconfigured PGP encryption policies.
Assumptions made in this workflow:
Apache NiFi Installation: It is assumed that Apache NiFi is installed and properly configured in your environment. The configurations and settings discussed in this blog post are based on the assumption that you have a working Apache NiFi instance.
Oracle Database Connection: The blog assumes that you have a running Oracle database with the necessary access credentials (IP address, username, and password) to connect and query the “Youtube Optins” table. Adjust the connection details according to your Oracle database setup.
Veracode API Configuration: The Veracode API endpoint and authentication details are assumed to be provided by Veracode. Make sure to obtain the correct API endpoint and configure the necessary headers and request body based on Veracode’s API documentation and requirements.
SplitJSON Configuration: The SplitJSON processor is configured to split the data into smaller chunks of 10,000 records. Adjust the split count as per your requirements or based on the limitations of downstream processors.
SMTP Server Configuration: The SMTP server used for sending emails is assumed to be an existing and preconfigured server. The SMTP hostname, port, username, and password provided in the sample configuration are placeholders. Replace them with the actual SMTP server details provided by your email service provider or IT team.
PGP Encryption: The EncryptContent processor is assumed to utilize a preconfigured PGP encryption policy for encrypting the CSV file. Ensure that you have the necessary encryption keys and policies in place to successfully encrypt the content.
Sample Data: The blog assumes that the “Youtube Optins” table in the Oracle database contains fields such as artist_name, song_name, revenue, bank_account, email, and mobile_number. Adjust the SQL query and field names according to your actual database schema.
Please note that the provided configurations and assumptions are for illustrative purposes only. It is crucial to adapt the configurations and settings to your specific environment, security policies, and requirements.
Workflow Overview:
Workflow Overview: Our data pipeline workflow consists of the following steps:
GenerateFlow to start the pipeline.
ExecuteSQL/ExecuteDatabase to query the Oracle database table.
SplitJSON to split the data into smaller chunks.
ConvertRecord to convert the JSON data to CSV format.
Invoke Veracode API for security scanning.
Handle Veracode scan results.
UpdateRecord to append the date to the CSV filename.
EncryptContent to encrypt the CSV file.
SendEmail to send the encrypted CSV file via SMTP.
Nifi processors and configurations in detailed
- GenerateFlow: Starting the Pipeline
To start the data pipeline, we’ll use the GenerateFlowFile processor to generate a dummy flow file. This processor doesn’t require any specific configuration, as it simply generates a flow file to initiate the pipeline.
Processor: GenerateFlowFile
Configurations:
Filename: Daily_Youtube_Optins_${now():format(‘yyyyMMdd’)}.csv
GenerateFlowFile Mode: Continuous
2. ExecuteSQL: Querying the Oracle Database
To query the Oracle database table, we’ll use the ExecuteSQL processor. Configure the processor as follows:
Database Connection Pooling Service: Specify the Oracle database connection details.
SQL Query: Enter the SQL query to retrieve the required data.
ExecuteSQL processor queries the complex Oracle database table and retrieve the required data. Remember, we need to configure the processor with the appropriate database connection details and the SQL query to retrieve the data.
Processor: ExecuteSQL
Configurations:
Database Connection URL: jdbc:oracle:thin:@163.234.01.10:1521:orcl
Username: Google user
Password: @Google2023
SQL Query: SELECT artist_name, song_name, revenue, bank_account, email, mobile_number FROM “Youtube Optins”
Let’s assume we have a table called transactions
with columns such as transaction_id
, customer_name
, amount
, and transaction_date
. You can use the following SQL query to retrieve the data:
SELECT transaction_id, customer_name, amount, transaction_date
FROM transactions
WHERE transaction_date = CURRENT_DATE
Make sure to configure the ExecuteSQL processor with the appropriate database connection details and the provided SQL query.
3. SplitJSON: Splitting the Data into Smaller Chunks
Once we have the data from the database, we’ll use the SplitJSON processor to split the JSON data into smaller chunks. This step is necessary if the resulting data is too large to handle as a single flow file. Configure the processor to split the JSON data based on the desired criteria, such as a maximum number of records or a specific field.
Adjust the split count as per your requirements or based on the limitations of downstream processors
Processor: SplitJson
Configurations:
Split Count: 10000
4. ConvertRecord: Converting JSON to CSV
After splitting the JSON data, we’ll use the ConvertRecord processor to convert the JSON data to CSV format. Configure the processor to define the JSON-to-CSV conversion rules, such as specifying the input and output schemas, field mappings, and CSV delimiter.
Processor: ConvertRecord
Configurations:
Input Format: JSON
Output Format: CSV
Schema Access Strategy: Use ‘Inferred Schema’
CSV Format Settings: Delimiter, Quote Character, etc
5. Invoke Veracode API for Security Scan
Next, we’ll use the InvokeHTTP processor to invoke the Veracode API for security scanning. Refer to the Veracode API documentation for the specific endpoint and authentication requirements. Configure the InvokeHTTP processor with the following settings:
HTTP Method: Choose the appropriate method (POST or PUT) based on the Veracode API requirements.
Remote URL: Specify the Veracode API endpoint for security scanning.
Headers: Add any necessary headers for authentication and API versioning.
Content-Type: Set it to the appropriate value based on the data being sent.
For invoking the Veracode API, we’ll use the InvokeHTTP processor.
Configure the processor with the appropriate HTTP method (POST or PUT) based on the Veracode API requirements. Specify the Veracode API endpoint for security scanning and add any necessary headers for authentication and API versioning. Additionally, include the request body to send the CSV data or the resulting file as part of the request payload.
Processor: InvokeHTTP
Configurations:
HTTP Method: POST or PUT (depending on Veracode API requirements)
Remote URL: Specify the Veracode API endpoint for security scanning
Headers: Add necessary headers for authentication and API versioning
Request Body: Send CSV data or resulting file as part of the request payload Configure the processor as follows:
HTTP Method: Set it to POST or PUT based on the Veracode API requirements.
Remote URL: Specify the Veracode API endpoint for security scanning.
Headers: Add the necessary headers for authentication and API versioning.
Content-Type: Set it to the appropriate value based on the data being sent.
6. Handle Veracode Scan Results
To process the Veracode API response, we’ll use the EvaluateJsonPath processor. Configure the processor as follows:
Destination: Set it to “flowfile-attribute” or “flowfile-content” based on where you want to store the extracted information.
JSONPath expressions: Specify the JSONPath expressions to extract the required information from the Veracode API response.
After invoking the Veracode API, handle the scan results using EvaluateJsonPath processor or other relevant processors based on the Veracode API response format. Configure the processor to extract and process the required information from the Veracode API response.
For example, you can specify JSONPath expressions to extract the scan status or other relevant data.
Processor: EvaluateJsonPath or other relevant processors based on Veracode API response format
Configurations: Specify JSONPath expressions or other settings to extract and process required information from Veracode API response
For example, if the API response is in JSON format and contains the scan status, you can use the following JSONPath expression language
$.scan_status
Specify the appropriate JSONPath expression in the EvaluateJsonPath processor configuration.
7. UpdateRecord: Appending the Date to CSV Filename
To append the date to the CSV filename, we’ll use the UpdateRecord processor. Configure the processor to update the filename field in the CSV data using the current date. This step ensures that each CSV file generated has a unique name that includes the date.
Processor: UpdateRecord
Configurations:
CSV Reader: Specify the input CSV reader controller service
CSV Writer: Specify the output CSV writer controller service
Record Reader: Specify the record reader controller service
Record Writer: Specify the record writer controller service
Replacement Value Strategy: Literal Value
Replacement Values:
Property: filename
Value: Daily_Youtube_Optins_${now():format(‘yyyyMMdd’)}.csv
8. EncryptContent: Encrypting the CSV File
Once we have the resulting data, we’ll encrypt the CSV file using the EncryptContent processor. Configure the processor with the necessary encryption settings, including the Keystore/Truststore Controller Service, encryption mode, cryptographic algorithm, and PGP key ID. Ensure that the input file path points to the generated CSV file, and specify the output file path for the encrypted file.
Here we are using the EncryptContent processor. Configure the processor to use the preconfigured PGP encryption policies for encrypting the content. This step ensures that the CSV file remains secure during transmission.
Processor: EncryptContent
Configuration: Utilize the preconfigured PGP encryption policies for secure encryption of the CSV file
To encrypt the resulting CSV file, Configure the processor as follows:
Keystore/Truststore Controller Service: Set it to the appropriate controller service with the necessary encryption settings.
Mode of Operation: Choose the encryption mode based on your requirements.
Cryptographic Algorithm: Select the desired encryption algorithm.
PGP Key ID: Enter the PGP key ID for encryption.
Input/Output Files: Specify the input and output file paths for encryption.
9. SendEmail: Sending the Encrypted CSV File via SMTP
Finally, we’ll use the SendEmail processor to send the encrypted CSV file as an attachment via SMTP. Configure the processor with the SMTP server details, such as the server address, port, and authentication credentials. Specify the sender, recipients, subject, and message body for the email. Additionally, attach the split CSV files (resulting from the SplitRecord processor) to the email if required.
To send the encrypted CSV file via SMTP, configure the processor as follows:
SMTP Service: Specify the preconfigured SMTP service details for sending emails.
Sender, Recipients: Enter the email addresses of the sender and recipients.
Subject, Message Body: Set the subject and content of the email.
Attachments: Add the encrypted CSV file as an attachment.
Processor: SendEmail
Configuration:
SMTP Hostname: smtp.google.com
SMTP Port: 587
SMTP Username: Your SMTP username
SMTP Password: Your SMTP password
Sender: Specify the sender’s email address
Recipients: Specify the recipient’s email address(es)
Subject: Provide the subject line for the email
Message Body: Add a customized message body for the email
Attachments: Attach the encrypted CSV file to the email
Conclusion:
In this blog post, we have taken a deep dive into building a secure data pipeline using Apache NiFi. We integrated the Veracode API for security scanning and sent the resulting CSV file as an encrypted attachment via SMTP. Throughout the workflow, we explained each processor’s configuration and highlighted the significance of integrating Veracode for security purposes. We have covered each step of the workflow, providing detailed explanations and sample configurations for each NiFi processor involved. By leveraging the power of Apache NiFi, you can confidently extract data from Oracle databases, perform security scans using Veracode, and securely send processed data via email.
Remember to adapt the configurations to your specific environment and requirements, and consult the official Apache NiFi documentation for more in-depth information on each processor’s configurations and usage. With Apache NiFi, you can build robust and secure data pipelines that meet the demands of modern data engineering.
Happy pipeline building!
About the Author:
Emmanuel Odenyire Anyira is a Senior Data Analytics Engineer at Safaricom PLC. With extensive experience in designing and building data collection systems, processing pipelines, and reporting tools, Emmanuel has established himself as a thought leader in the field of data analytics and infrastructure management. He possesses expertise in various technologies, including Apache NiFi, Informatica PowerCenter, Tableau, and multiple programming languages. Emmanuel’s passion for automation and optimizing workflows has driven him to share his insights and expertise through writing and speaking engagements.