Data Engineer
Job Description
Design, build, and performance-tune Apache Spark workloads using Spark SQL and PySpark for complex transformations (JSON/semi-structured data, nested structures, window functions, joins, aggregations).
2. Profile and optimize Spark jobs: partitioning, shuffles, join strategies, skew, memory/spill, and right-sized resource usage—especially on EMR Serverless—for large-scale and petabyte-scale data.
3. Support Customers and Monitor Pipelines with Strict SLA for Fixs and Re Instating Issues around the clock.
4. Implement reusable patterns for incremental loads, deduplication and CDC-style processing.
5. Build and maintain ETL/ELT on AWS EMR Serverless (Spark), with S3 as the data lake layer: partitioning, compression, external tables, and layouts that support fast Spark and downstream SQL.
workloads: sort keys, distribution, and SQL patterns that fit S3 Spark Redshift flows.
7. Optimize cost and performance across Spark jobs, S3 storage, and Redshift (including retention and lifecycle thinking where relevant).
8. Produce end-to-end designs: pipeline topology, data models, staging vs curated layers, incremental strategies, and clear tradeoffs (freshness, cost, complexity, reliability).
9. Apply access controls for sensitive financial and user data (least privilege, row/column-level patterns where required).
Similar Jobs
Data Engineer
New Jersey
SQL Data Engineers
AZ
Jr. Data Engineer
Texas
Data Engineer With AI/ML
Remote
Azure Data Engineer
Remote