<< ---------------------------------------------------------------- >>
--- Last Modified: $= dv.current().file.mtime
Amazon Athena
interactive query service that lets you analyze data directly from s3. It is based on Apache Presto
It is serverless and is compatible with the following services:
- cloudFormation
- CloudFront
- CloudTrail
- DataZone
- ELB
- EMR
- AWS Glue Data Catalog
- IAM
- QuickSight
- S3 inventory
- Step Functions
- Systems Manager Inventory
- VPC
You can create the tables either manually or through AWS Glue Wizard. The AWS Glue crawler can crawl the data and create them automatically.
Athena tables are Glue Data Catalog tables and so they will exist in both services when creaing an Athena table.
SerDe
serialization and deserializtion libraries for parsing dadta from different fromates such as csv, json, parquet andn orc for Athena.
AWS Glue
serverless data integration service that makes it easy for analytics users to discover, prepare, move and integrate data from multiple sources.
Used for analytics, ML, application development
You can run search and query for data catalogues using Athena, EMR, Redshift spectrum.
It can do
- data discovery
- modernt ETL or ELT
- cleansing
- transforming
- centralized cataloging
AWS Glue Jobs
you can use
- python shell engine
- ray job
- spark job
can be created in Visual ELT(AWS Glue Studio), Jupyter notebooks, script editor.
Are charged based on the number of data processing units. which is determined by a comnination of work type and number of workers.
Glue Studio
Visually build ETL pipeline.
Glue Data Catalog
is a fully managed Apache Hive Metastore-compatible catalog service, that makes it easy for customers to store, annotate, and share metadata about their data.
Glue Data Crawler
took that is used to analyze a targeted data source to determine its schema and generate Glue Data Tables.
Data Lakes
Centralized data repository for unstructure and semi-structured data.
Generally use object(blob) or files as its storage medium.
They pull from various data sources and then transform, distribute or publish the data.
Lake Formation
Data lake to centrally govern ,secure and globally share data for analytics and Machine Learning.
same integrations as glue