Apache Airflow Administration
Apply NowCompany: Centraprise
Location: Palo Alto, CA 94303
Description:
Apache Airflow Administration (optimization, configuration, backup/disaster recovery, monitoring)
Charlotte, NC/Chandler, AZ/Dallas, TX/Minneapolis, MN/NY Metro &Palo Alto, CA
Duration: 6+ months (extension probable)
Qual notes:
Overview: Working with the Data Platform and Data Delivery Team: they have built a framework in production across 9 different data domains as a solution for a single source for data insights/analytics across the domains. Have developed a self-service data platform where consumers can access the data.
Framework built using PySpark, Hadoop (Hive, HDFS, Yarn), Airflow scheduler
3 data sources: API, database ingestions (SQL, Postgres, Oracle), Flat file base ingestion. Will eventually be migrating to streaming. Downstream data: extract and send as flat files, publish directly to Tableau server, ingesting into SQL server database
As they onboard more consumers and scale the platform, there is a lot of strain on Airflow and don't have an SME to handle the optimization, backup and disaster recovery, day to day monitoring and performance tuning.
Will be working with 2 Engineering Teams: Data Management and Data Delivery. Build the Airflow DAGs, Hadoop, PySpark data pipeline development, has Production Support
Have installed Airflow and have very basic configurations. Moving from old to new cluster so want this person to understand multi-tenant loads, how to configure, customize, and optimize Airflow.
Will not be responsible for building any DAGs, but optimizing what's in production, monitoring. Using Prometheus for monitoring and alerts.
What are best practices, what can they bring to the table from previous role?
Immediate need will be to optimize for concurrent loads. Once airflow environment is stable, want to take on-prem to the cloud (GCP), and will be responsible for some of the PySpark work
Ideal candidate: Airflow SME, GCP, 2 years' experience in Hadoop tech stack (Hive, HDFS, Yarn; more from environment standpoint, will not be expected to do Admin work/development of Hadoop platform), 2 years' in developing PySpark code. If someone is a rockstar in Airflow and lacking everything else, can consider them. Willing to flex on amount of Hadoop and PySpark experience.
Have backup, failover, and disaster recovery in place but looking for this person to optimize their strategies and make recommendations.
**Mentioned that a startup type of environment might be good to target. Vetting questions: How big of an Airflow environment have you worked on? Monitoring or optimizations? Root cause and how did you resolve?
Must Have:
Apache Airflow Administration (optimization, configuration, backup/disaster recovery, monitoring)
PySpark
Hadoop (worked in environment--do not need to have Administered or Engineered)
**CAN consider just a very strong Airflow Admin
Nice to Have: Google Cloud
Charlotte, NC/Chandler, AZ/Dallas, TX/Minneapolis, MN/NY Metro &Palo Alto, CA
Duration: 6+ months (extension probable)
Qual notes:
Overview: Working with the Data Platform and Data Delivery Team: they have built a framework in production across 9 different data domains as a solution for a single source for data insights/analytics across the domains. Have developed a self-service data platform where consumers can access the data.
Framework built using PySpark, Hadoop (Hive, HDFS, Yarn), Airflow scheduler
3 data sources: API, database ingestions (SQL, Postgres, Oracle), Flat file base ingestion. Will eventually be migrating to streaming. Downstream data: extract and send as flat files, publish directly to Tableau server, ingesting into SQL server database
As they onboard more consumers and scale the platform, there is a lot of strain on Airflow and don't have an SME to handle the optimization, backup and disaster recovery, day to day monitoring and performance tuning.
Will be working with 2 Engineering Teams: Data Management and Data Delivery. Build the Airflow DAGs, Hadoop, PySpark data pipeline development, has Production Support
Have installed Airflow and have very basic configurations. Moving from old to new cluster so want this person to understand multi-tenant loads, how to configure, customize, and optimize Airflow.
Will not be responsible for building any DAGs, but optimizing what's in production, monitoring. Using Prometheus for monitoring and alerts.
What are best practices, what can they bring to the table from previous role?
Immediate need will be to optimize for concurrent loads. Once airflow environment is stable, want to take on-prem to the cloud (GCP), and will be responsible for some of the PySpark work
Ideal candidate: Airflow SME, GCP, 2 years' experience in Hadoop tech stack (Hive, HDFS, Yarn; more from environment standpoint, will not be expected to do Admin work/development of Hadoop platform), 2 years' in developing PySpark code. If someone is a rockstar in Airflow and lacking everything else, can consider them. Willing to flex on amount of Hadoop and PySpark experience.
Have backup, failover, and disaster recovery in place but looking for this person to optimize their strategies and make recommendations.
**Mentioned that a startup type of environment might be good to target. Vetting questions: How big of an Airflow environment have you worked on? Monitoring or optimizations? Root cause and how did you resolve?
Must Have:
Apache Airflow Administration (optimization, configuration, backup/disaster recovery, monitoring)
PySpark
Hadoop (worked in environment--do not need to have Administered or Engineered)
**CAN consider just a very strong Airflow Admin
Nice to Have: Google Cloud