2

I am new to AWS and i have learnt and developed code in spark -scala .

My application basically merge two files in spark and created final output.

I read both files (MAIN files and INCR files )in spark from S3 bucket .

All are working fine and i am getting correct output also . But i dont know how to automate whole process to put in production .

Here are the steps the i am doing in order to get output .

STEP 1: Loading Main files (5K text files ).I am reading files from FTP in EC2 and then uploading in the S3 bucket .

STEP 2: Loading INCR (incremental files) same way as i am loading MAIN files .

STEP 3: Creating EMR cluster manually from UI.

STEP 4: Opening Zeppelin note book and copy paste spark-scala script and run .

STEP 5: Again creating EC2 instance to read S3 bucket and send output files from S3 to FTP client .

I am using EC2 because in my case i dont have direct connect from S3 to FTP .We are in a process to get DIRECT CONNECT from AWS .

Please assist me how can i automate in a best way .

1 Answers1

1

You have 2 options. If you also want to shutdown and start cluster then option 1, the AWS Data Pipeline is better:

1. AWS Data Pipeline DeveloperGuide: what-is-datapipeline

AWS Data Pipeline

2. Oozie Workflow

You could use Oozie. Here is an example how to automate Spark jobs in AWS with Oozie:

https://aws.amazon.com/de/blogs/big-data/use-apache-oozie-workflows-to-automate-apache-spark-jobs-and-more-on-amazon-emr/

If you have already the FTP connection for the data transfer on your EC2 instance then you can trigger a shell script from Oozie. Here is an example running-shell-script-with-oozie

Rene B.
  • 170
  • 2
  • 8
  • But oozie will not help me automate my creating cluster and loading s3 files ?It will help me just to automate my job .. – Sudarshan kumar Jan 16 '18 at 08:21
  • Ah, it was not clear that you want to initialise also the cluster. I though its running all the time. In this case, you could also use the [AWS Data Pipeline](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html) in order to automate the movement and transformation of data. – Rene B. Jan 16 '18 at 09:04
  • so we can create cluster also from AWS data pipeline ? Also my spark job will run once in a week and also i have upload files once in a week only . – Sudarshan kumar Jan 16 '18 at 10:54
  • yes, you can launch a cluster using the command line as described here: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-launch-emr-jobflow-cli.html – Rene B. Jan 16 '18 at 11:45
  • Just one more question ..I am using Data pipleine ...But curious to know the difference between Data pipeline and Cloud formation template ? – Sudarshan kumar Apr 05 '18 at 15:41