From the course: AWS: Storage and Data Management

AWS Glue overview

- [Instructor] AWS Glue is a fully managed serverless data integration service based on PySpark, the Python implementation of Apache Spark. With Glue, you design data flows by connecting sources to targets with transformations in between. The Glue wizard and GUI help you define these jobs which generate PySpark code. If you're already familiar with Python and Apache Spark, you'll be right at home. If not, Glue can get you started by proposing designs for some simple ETL jobs. Because Glue is serverless, although you pay for the resources consumed by your running jobs, you never have to create or manage an EC2 instance. Another core feature of Glue is that it maintains a metadata repository of your various data schemas, this could be relational table schemas, the format of a delimited file, and more. Although it is sometimes confusing, Glue calls these metadata repositories databases. You can define the schema in one of two ways, first, you can manually enter it, this would involve you typing the name of each data column or field, then specifying its type and data width. Alternatively, Glue can search your data sources and discover on its own what data schemas exist. To do this, you must define what's called a crawler. Crawlers can read from S3, RDS, or a JDBC source. They can discover table schemas, but do not discover relationships. And they can be scheduled to update themselves over time. With crawlers keeping your metadata up to date, mapping source data to destinations becomes fairly straightforward. Keep in mind that existing jobs are not automatically aware when schemas change and may need to be refreshed. Jobs can be triggered on a schedule such as daily or monthly on completion of another job where we chain dependent jobs or on demand. Finally, a few caveats with Glue. Unlike many popular ETL packages, it has no third-party connectors. You're not going to be connecting to Salesforce out of the box with Glue. As I mentioned earlier, when schemas change, you'll need to update the jobs that use them. Chaining dependent jobs is possible, but job chaining is not easy to visualize once it's built. Finally, the wizard and GUI are really only suitable for very simple jobs, after that, you'll be writing Python. With that said, let's get into Glue and see what it can do.

Contents