Job Description
Role : AWS PySpark Data Engineer
Location : Reston, VA (Hybrid)
Type: Contract
Job Description:
We are seeking a highly skilled AWS PySpark Data Engineer to join our growing team. In this role, you will be responsible for designing, developing, and optimizing big data pipelines using AWS services and PySpark. You will work closely with data scientists, analysts, and other engineers to build scalable data architectures and drive business insights.
Key Responsibilities:
• Design and Develop Data Pipelines: Build scalable and efficient data pipelines using PySpark on AWS (Amazon EMR, AWS Glue, AWS Lambda).
• Data Transformation: Implement data transformations and cleansing using PySpark and AWS Glue.
• Cloud Integration: Leverage AWS services such as S3, Redshift, Athena, and Lambda to create data workflows.
• Data Modeling: Collaborate with the data architecture team to define data models for structured and unstructured data.
• Performance Tuning: Optimize PySpark code and AWS resource usage for high performance and cost-efficiency.
• Collaborate with Cross-Functional Teams: Work with data scientists, analysts, and other engineers to support data-driven projects.
• ETL Development: Create ETL (Extract, Transform, Load) processes using AWS Glue and PySpark.
• Data Quality Assurance: Ensure data accuracy, integrity, and reliability across pipelines.
• Monitoring & Logging: Set up monitoring, logging, and alerting for the data pipeline health and performance.
• Documentation: Maintain clear and comprehensive documentation for data pipelines and architecture.
• Required Skills & Qualifications:
• Experience with AWS: Hands-on experience with AWS services such as S3, EC2, Lambda, Glue, Redshift, and Athena.
• PySpark Expertise: Solid experience in PySpark for data transformation, processing, and optimization.
• Big Data Technologies: Knowledge of big data frameworks and processing systems such as Apache Hadoop, Spark, and Kafka.
• ETL Development: Strong skills in designing and developing ETL pipelines using AWS Glue or other tools.
• Programming Skills: Proficient in Python, SQL, and familiarity with Java/Scala.
• Data Modeling and Warehousing: Experience with designing data models and building data warehouses (Amazon Redshift, etc.).
• Version Control: Familiarity with Git for version control.
• Cloud Security & Best Practices: Knowledge of security best practices, data encryption, and IAM roles in AWS.
Preferred Qualifications:
• Certification: AWS Certified Data Analytics Specialty, or similar.
• Experience with Kubernetes: Knowledge of deploying big data workloads using Kubernetes (EKS).
• Data Visualization: Experience with tools like Tableau, Power BI, or AWS QuickSight.
• Knowledge of CI/CD: Familiarity with Continuous Integration and Deployment in the data pipeline lifecycle.
Responsibilities
- In this role, you will be responsible for designing, developing, and optimizing big data pipelines using AWS services and PySpark
- You will work closely with data scientists, analysts, and other engineers to build scalable data architectures and drive business insights
- Design and Develop Data Pipelines: Build scalable and efficient data pipelines using PySpark on AWS (Amazon EMR, AWS Glue, AWS Lambda)
- Data Transformation: Implement data transformations and cleansing using PySpark and AWS Glue
- Cloud Integration: Leverage AWS services such as S3, Redshift, Athena, and Lambda to create data workflows
- Data Modeling: Collaborate with the data architecture team to define data models for structured and unstructured data
- Performance Tuning: Optimize PySpark code and AWS resource usage for high performance and cost-efficiency
- Collaborate with Cross-Functional Teams: Work with data scientists, analysts, and other engineers to support data-driven projects
- ETL Development: Create ETL (Extract, Transform, Load) processes using AWS Glue and PySpark
- Data Quality Assurance: Ensure data accuracy, integrity, and reliability across pipelines
- Monitoring & Logging: Set up monitoring, logging, and alerting for the data pipeline health and performance
- Documentation: Maintain clear and comprehensive documentation for data pipelines and architecture
Requirements
- Experience with AWS: Hands-on experience with AWS services such as S3, EC2, Lambda, Glue, Redshift, and Athena
- PySpark Expertise: Solid experience in PySpark for data transformation, processing, and optimization
- Big Data Technologies: Knowledge of big data frameworks and processing systems such as Apache Hadoop, Spark, and Kafka
- ETL Development: Strong skills in designing and developing ETL pipelines using AWS Glue or other tools
- Programming Skills: Proficient in Python, SQL, and familiarity with Java/Scala
- Data Modeling and Warehousing: Experience with designing data models and building data warehouses (Amazon Redshift, etc.)
- Version Control: Familiarity with Git for version control
- Cloud Security & Best Practices: Knowledge of security best practices, data encryption, and IAM roles in AWS