Hadoop Developer
Phanish

Professional Summary

  • CCAH (Cloudera Certified Administrator for Apache Hadoop) certified professional.
  • AWS (Amazon Web Services) Certified Solutions Architect – Associate
  • 7+ years of overall IT experience, in various roles including Development, Administration, DevOps and Automation. 4 years of experience in Big Data environments.
  • 2+ years of DevOps experience in Build process, Process Automation, Release Management, Source Code repository management, Hadoop and Linux Administration.
  • Experience with automation/configuration management with Puppet and Chef
  • Involved in all phases of Software Development Life Cycle (SDLC) and Worked on all activities related to the development, implementation, administration and support of ETL processes for large-scale Data Warehouses.
  • Experience integrating Application/Service Monitoring tools such as Nagios, Ganglia, Zabbix with Big Data Frameworks.
  • Partnered with development teams to prepare for timely and smooth acceptance of deliverables into production environment
  • Coordinated with Quality Engineering teams to conduct performance testing early and often, release packaging, documentation and deployment steps.
  • Experience in deploying and managing Big Data Frameworks on Public and private cloud environments and in building out hybrid designs leveraging on-premise and public cloud environments.
  • Skilled at working with AWS services including but not limited to EC2, VPC, S3, RDS, EBS, EMR, Cloud Formation, Direct Connect, Cloud Watch and others.
  • Experience in building out scalable solutions leveraging AWS components such as Cloud Watch, Auto Scaling, and others.
  • Experience in administrating and managing NoSQL systems such as HBase and Cassandra at scale.
  • Experience in implementing Security for various Big Data Frameworks leveraging tools/frameworks such as Kerberos (Authentication), Ranger/Sentry (Authorization), HDFS TDE/Navigator (Encryption), Knox (Perimeter Access Control)
  • Experience in designing and implementing Disaster Recovery strategies for Hadoop and various NoSQL frameworks.
  • Extensively worked on administering various Linux distros such as RedHat, CentOS, Ubuntu
  • Worked in 24x7 on-call support environments, managing escalations to Emergency Response Teams, driving Root Cause Analysis, and participation in weekly operations review.
  • Experience in understanding the client’s Big Data business requirements and transform it into Hadoop centric technologies.
  • Analyzing the clients existing Hadoop infrastructure and understanding the performance bottlenecks and provide the performance tuning accordingly.
  • Experience in Data migration from existing data stores to Hadoop.
  • Experience in developing capacity plans for new and existing systems of the Hadoop systems.
  • Experience in designing and implementing complete end-to-end Hadoop Infrastructure.
  • Experience in understanding customer’s multiple data sets, which include Behavioral data, customer profile data, Sales data and product data.
  • Experience in Importing and exporting data from different databases like MySQL, Oracle into HDFS and Hive using Sqoop.
  • Developed Map Reduce programs to perform Data Transformation and analysis.
  • Experience in analyzing data with Hive and Pig using on read data schema
  • Defining job flows in Hadoop environment-using tools like Oozie for data scrubbing and processing.
  • Experience in Cluster coordination services through Zookeeper.
  • Loading logs from multiple sources directly into HDFS using Flume.
  • Knowledge in implementing the Fair and Capacity Scheduler.
  • Experience in handling the xml data and used map reduce jobs to parse the data.
  • Experience in setting up monitoring tools like Nagios and Ganglia for Hadoop.
  • Excellent verbal and written communication skills combined with interpersonal and conflict resolution skills and possess strong analytical skills.
  • Exceptional ability to quickly master new concepts and technologies.

Education

  • Masters in Computer Science Engineering, Towson University, Maryland.
  • Bachelors in Computer Sciences, JNTU University

Technical Skills

Hadoop Ecosystem
HDFS, YARN, Hive, Pig, Spark, Impala, Sqoop, Flume, Kafka, Zookeeper, Oozie, Cloudera Manager, Apache Ambari
NoSQL
Cassandra, HBase
Relational Database Systems
MySQL, PostgreSQL, Oracle, SQL Server 2005/2008
Languages
C, Java, SQL, Pig Latin, UNIX Shell Scripting, Ruby, Python
Automation Frameworks
Puppet, Chef
Cloud Environments
Amazon Web Services, Microsoft Azure, OpenStack
Virtualization
VMWare vSphere, Docker
Security
MIT Kerberos, LDAP, AD, Sentry, Ranger, Knox
Continuous Integration
Jenkins, Bamboo
VCS (Version Control Systems)
Git, SVN
Monitoring and Alerting
Ganglia, Zabbix, Nagios, OpenTSDB
Operating Systems
RedHat, CentOS, Ubuntu, Mac OSX, Windows Server 2003/2008/2012, Windows

Certifications

  • CCAH Certification (Cloudera Certified Administrator for Apache Hadoop).
  • AWS Certified Solutions Architect – Associate Level

Professional Experience

Autodesk Inc. San Rafael, CA
Duration
Jan 2016 – Present
Role
Hadoop DevOps Engineer
Responsibilities
The Enterprise Cloud Consumption reporting project at Autodesk Inc. aims to migrate the product and cloud services usage data collected by different systems installed at Autodesk Inc. from legacy EURP (End user reporting platform) to new Enterprise Cloud Consumption reporting platform. This is to enable a better, faster and more efficient access to usage trend analysis, potential analysis for product / cloud services sales to new/ existing clients of Autodesk Inc. The new architecture reports will be available to both internal and external customers.

  • Primarily responsible for deployments, configuration management and server upgrades of the client's network.
  • Responsible for capacity planning and demand forecast projections.
  • Effectively leveraged Puppet in propagating the configurational changes and service daemon changes throughout the cluster.
  • Wrote custom shell/ruby scripts for use by Network Operations team to control the cluster service daemons.
  • Responsible for managing large scale Hadoop deployments on AWS leveraging various AWS components using EC2, VPC, S3, EMR, RDS, AutoScaling, CloudWatch, SNS, and others.
  • Worked on a pilot project of redirecting data onto Amazon Web Services Cluster using AWS Elastic Load Balancers and Auto-scaling features.
  • Created and maintained Amazon Virtual Private Cloud (VPC) resources such as subnets, network access control lists, and security groups.
  • Identified and resolved issues in the cluster by performing root cause analysis and effectively coordinated with cross-functional teams in the process.
  • Performance tuned the existing cluster(s) and scale the capacity of the clusters to cater to the increased application traffic and consequently performed application load balancing.
  • Responsible for setting up and maintaining CI pipeline which monitors the GIT repositories build and deploy code to pre-production environments based on the changes published to master branches.
  • Perform ongoing capacity management forecasts including timing and budget considerations.
  • Developed custom queries for Enhancing Tableau data model/Heat Map.
  • Experience in creating the subdoc for the continuous delivery system.
  • Experience in using the GitHub for the version control.
  • Developed the automation scripts for the Reconciliation process.
Visa Inc, Foster City, CA
Duration
Jan 2015 - Dec 2015
Role
Hadoop Consultant
Responsibilities
Visa Inc. is an American multinational financial services corporation headquartered in Foster City, California. The main objective of the project, which I worked, is to increase the Visa merchant partner’s digital advertising ROI by helping them optimize their online advertising spend via offline sales attribution insights. This is being achieved by strategically partnering with key players across the digital advertising ecosystem, to include some of the same leading digital search and display advertising companies that visa merchants aligned with.

  • Automate the provisioning, deployment, scaling and monitoring of the Hadoop platform across multiple environments.
  • Resolve complex technical issues and drive innovations that improve system availability, resilience and performance
  • Used Cloudera Director to automate Hadoop installation on private OpenStack environments for Test, Stage and Dev environments
  • Design and implement build pipelines leveraging Jenkins, Git and Autosys
  • Worked in a hybrid cloud environment using OpenStack and coordinated in the migration of the existing cluster’s data onto AWS environment.
  • Installed and configured a QA datacenter cluster using OpenStack.
  • Part of support team responsible for supporting applications and users on various environments.
  • Responsible for managing and tuning 200 nodes Impala deployment.
  • Part of on-call support team of 5, who are responsible to keeping up the up time of the Big Data environments in production and other environments.
  • Demonstrated experience in architecture, engineering and implementation of enterprise-grade production big data use cases.
Development and ETL Design in Hadoop
  • Developed MapReduce Input format to read visa specific data format.
  • Performance tuning of Hive Queries written by data analysts.
  • Developing Hive queries and UDF’s as per requirement.
  • Migrating existing Ab initio transformation logic to Hadoop Pig Latin and UDF's.
  • Used Sqoop to efficiently Transfer data from DB2 to HDFS, Oracle Exadata to HDFS.
  • Designing ETL flow for several newly on boarding Hadoop Applications.
  • Implemented NLine Input Format to split a single file into multiple small files.
  • Designed and Developed Oozie workflows, integration with HCatalog/Pig.
  • Documented ETL Best Practices to be implemented with Hadoop.
Bank of America, Charlotte, NC
Duration
Jan 2013-Dec 2014
Role
Hadoop Engineer
Responsibilities
The Bank of America Corporation is an American multinational banking and financial services corporation headquartered in Charlotte, North Carolina. It is the second largest bank holding company in the United States by assets. The main goal of this project is to process all the Auto Loans Data at one single unified location and to replace the existing legacy process after proper testing for more than year of time. Data is imported into HDFS from Mainframes and Teradata. Transformations are applied to create a single merge and flattened view. Data is then exported to Netezza for the downstream consumption.

  • Manage the day-to-day operations of the cluster for backup and support.
  • Responsible for scaling the cluster and demand forecasting to meet the business requirements and scale applications with the growth of the data.
  • Debug complex service issues including service incidents, complex customer setups, field trials, performance issues.
  • Plan and execute on system upgrades for existing Hadoop clusters.
  • Participate in development/implementation of Cloudera Hadoop environment.
  • Manage the backup and disaster recovery for Hadoop data.
  • Performance analysis and debugging of slow running development and production processes.
  • Assist with development and maintain the system documentation.
  • Create and publish various production metrics including system performance and reliability information to systems owners and management.
  • Coordinate root cause analysis (RCA) efforts to minimize future system issues.
  • Prepare and maintain runbooks for L1, L2 teams.
  • Developed Map reduce programs for parsing Main Frame EBCDIC data.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters.
  • Explored all the Sqoop and Avro issues in CDH4, with all the different traditional databases.
  • Developed shell scripts for reconciliation and kicking off Oozie workflows.
  • End-to-End implementation with AVRO and Snappy.
  • Exported data to Netezza, fixed issues with bad records while exporting data to Netezza.
  • Experience in using the SVN for the version control.
  • Experience in using ANT for the Deployment in different environments.
Wal-Mart, Bentonville, AR
Duration
Mar 2012-Dec 2012
Role
Hadoop Developer
Responsibilities
Currently working on Customer Knowledge Project (CKP). The main goal of the project is to get the number of users enrolled in Wal-Mart and Sam’s. Data from the Wal-Mart, Sam’s, Same Brick, Mortar, Wal-Mart Brick and rotor data present in the different data sources like Teradata and DB2 servers will be imported into HDFS using Sqoop. After dumping entire data, each customer will be assigning a unique identifier number and there after this data will be joined with Experian PII data. High level data analytics is done on each customer' data.

  • Analyzed the requirements from the customers and participated in Agile (Scrum methodology) for the daily updates and further development.
  • Involved in Hadoop POC Hadoop Team converting the existing ETL to leverage Teradata computation and storage.
  • Involved in the process to test the cluster performance with different benchmark techniques.
  • Worked on building a data pipeline to bring data from Teradata, DB2 Systems to Hadoop Environment using Sqoop.
  • Worked on Writing Hive queries for data analysis to meet the business requirements.
  • Worked on writing custom map-reduce programs.
  • Pulling Mainframe data using NDM and FTP.
  • Implemented Hadoop Float equivalent to the Teradata Decimal.
  • Responsible for running workflow jobs with actions that execute the Hadoop jobs such as Map reduce, Pig, Hive, Sqoop and sub workflow’s using OOZIE.
  • Worked on Importing and exporting data from different databases like MySQL, Oracle into HDFS and Hive using Sqoop.
  • Experience in using HP quality center for tracking and bug fixing.
InComm, Atlanta, GA
Duration
Sep 2011-Feb 2012
Role
Hadoop Engineer
Responsibilities
InComm is the industry leading marketer, distributor and technology innovator of stored-value gift and prepaid products. Large amount of high volume data from customer’s transactions and other networks was streamed daily to Oracle database platform. From Oracle database platform, the data was loaded into Hadoop Cluster. This data was maintained for 90 days for analysis purposes. 600-700 GB of data was loaded daily into the cluster.

  • Worked on setting up the Hadoop cluster for the Environment.
  • Worked on pulling the data from oracle databases into the Hadoop cluster using the Sqoop import.
  • Worked with flume to import the log data from the reaper logs, syslog’s into the Hadoop cluster.
  • Data was pre-processed and fact tables were created using HIVE.
  • The resulting data set was exported to SQL server for further analysis.
  • Generated reports using Tableau report designer.
  • Automated all the jobs from pulling data from databases to loading data into SQL server using shell scripts.
  • Used Ganglia to monitor the cluster around the clock.
  • Supported Data Analysts in running Map Reduce Programs.
  • Worked on importing and exporting data into HDFS and Hive using Sqoop.
  • Worked on analyzing data with Hive and Pig.
  • Installed and configured NFS, Used NSLOOKUP to check information in the DNS.
  • Analyzed complex distributed production deployments, and make recommendations to optimize performance.
Cloudwick Technologies
Duration
Feb 2011-Aug 2011
Role
Hadoop Engineer
Responsibilities
Cloudwick, a Bay Area startup, provides professional Big Data services and facilitates the extraction of nuggets of information to support data-based decision making.

  • Providing implementation services using pre-configured scripts to get the ground running in matter of days instead of months
  • Providing integration services with other data sources in the enterprise by effectively using tools in Hadoop ecosystems
  • By being very motivated to achieve value creation and client satisfaction. To this end we support each of our specialists with companywide knowledge.
  • Worked on developing rapid deployment scripts for deploying Hadoop ecosystems in Puppet. These scripts installed Hadoop, Hive, PIG, Oozie, Flume, Zookeeper and other components in Hadoop ecosystems along with Monitoring scripts for Nagios and Ganglia.
  • Worked on developing scripts for performing benchmarking with Terrasort/Terragen.
  • Worked on multiple POC’s with focus on importing Data into HDFS from relational database.
  • Developed Hive queries to process the data for analysis by imposing read only structure on the stream data.
  • Performed analysis using Datameer. The data was directly accessed from HDFS and reports were populated in Datameer.
KSD INFOTECH
Duration
Jun 2008- July 2010
Role
System Administrator
Responsibilities
KSD INFOTECH is a fast growing full service company offering simplified IT Consultation, Application Development, Maintenance, Migration, Testing, IT Infrastructure, Business Intelligence and Training services across varied platforms for Energy and Utilities, Financial and Insurance, Life Sciences and Manufacturing sectors.

  • Authenticating the users using LDAP and Active Directory Protocols.
  • Backing up folders/Directories using shell scripting.
  • Used UNIX tools AWK, SED for processing logs.
  • Regular preventative maintenance (daily, weekly ...), boot and shutdown systems when needed, printers, backup media, tune systems for performance.
  • Developed automated scripts for performing regular Linux tasks.