hirak06datascience
About Candidate
Education
Work & Experience
• Created and maintained optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks.
• Developed pipelines in ADF using Linked services, Datasets, and Pipelines to extract, transform, and load data from sources such as Teradata, Blob storage, and Azure SQL Data Warehouse.
• Built the Oozie pipeline for file movements, Sqoop data transfers from Teradata or SQL, and exports into Hive staging tables with subsequent business-driven aggregations and main table loads.
• Established infrastructure for optimal ETL processes from diverse data sources using SQL and big data technologies like Hadoop Hive and Azure Data Lake Storage.
• Collaborated on ETL tasks to maintain data integrity and pipeline stability.
• Applied data modeling, ETL processes, and data warehousing concepts within Power BI and QlikView environments.
• Designed and implemented database solutions with Azure Blob Storage for data storage and retrieval.
• Deployed data factory pipelines to orchestrate data flows into SQL Databases.
• Maintained data processing solutions using Azure HDInsight, Azure Databricks, and Azure Stream Analytics for real-time and batch data processing.
• Cleansed, manipulated, and analyzed large datasets, including semi-structured and unstructured data (XML, JSON, CSV, PDFs) using Python.
• Developed Python scripts for data filtering, cleansing, mapping, and aggregation.
• Designed and developed ETL workflows in Informatica, leveraging Informatica Designer components to build complex mappings, sessions, and workflows for an Enterprise Data Warehouse.
• Created informative dashboards and data visualizations in Power BI, enabling business users to make data-driven decisions.
• Employed Big Query to manage and analyze large datasets, optimizing queries for efficient and scalable data processing.
• Translated business requirements into secure and maintainable code, aligning technical solutions with business objectives.
• Conducted extensive data analysis and root cause analysis to troubleshoot issues and improve data quality.
• Reviewed and streamlined business processes, translating them into effective BI reporting and analysis solutions to enhance operational efficiency.
• Adhered to SDLC processes and various project management methodologies to ensure timely and successful project delivery.
• Utilized data modeling techniques (star/snowflake schema design, data marts, slowly changing dimensions) to create physical and logical data models for data warehouses.
• Identified and resolved BI application performance bottlenecks through thorough analysis and tuning, enhancing overall system performance.
• Led various phases of SDLC from requirement gathering to testing, delivering scalable data solutions for healthcare analytics.
• Designed and implemented a streaming platform using AMQ-Streams, Kafka, Camel, and Spring, enhancing real-time data processing capabilities.
• Utilized Sqoop for efficient data loading into Spark SQL for RDDs, Datasets, and DataFrames creation.
• Implemented data formats such as Avro, Parquet, ORC, and JSON, and developed UDFs in Hive and Pig for advanced data processing.
• Managed Change Data Capture (CDC) using Qlik Replicate and automated data loading into HDFS.
• Developed Airflow scheduling scripts in Python for orchestrating complex data workflows.
• Led end-to-end machine learning workflows, from data gathering and preprocessing to model evaluation and deployment using Azure and Snowflake data sources.
• Implemented cloud computing solutions with HDInsight, Azure Data Lake, Azure Data Factory, and Azure Machine Learning, leveraging PowerShell scripting.
• Developed Spark scripts in PySpark for data processing and analytics.
• Built applications using Django and Flask frameworks, integrating REST APIs and leveraging dependency injection with Spring Framework.
• Designed and implemented data pipelines using Oozie and Airflow for efficient data processing and workflow automation.
• Utilized K-Streams for real-time data streaming and analytics, enhancing decision-making capabilities.
• Proficient in developing data-driven stories and advanced data analysis using QlikView and Power BI, creating intuitive visualizations for business users.
• Deployed and configured Power BI in cloud and on-premises environments, ensuring data security and automatic report refresh.
• Implemented Snowflake and Data Vault modeling approaches for managing large datasets and metadata.
• Optimized data processing using Spark's in-memory capabilities, efficient joins, and transformations.
• Implemented machine learning algorithms including linear regression, logistic regression, decision trees, random forests, and XGBoost for predictive analytics.
• Developed machine learning models using recurrent neural networks (LSTM) for time series analysis and predictive modeling.
• Automated processes using Ansible Python API and managed collections in Python for efficient data manipulation.
• Integrated Workday HCM modules and managed data pipelines using Fivetran and HVR for data replication into Snowflake.
6% saving in costs for component manufacturing giant by designing end-to-end solution – cloud migration, building predictive models
Hadoop Ecosystem and Data Integration:
• Utilized all Hadoop ecosystems (HDFS, YARN, MapReduce, Hive, Flume, Oozie, Zookeeper, Impala, HBase, Sqoop) through Cloudera Manager.
• Collected and aggregated large volumes of log data using Apache Flume, staging it in HDFS for further analysis.
• Created core Python API used across multiple modules for seamless integration.
• Implemented data ingestion and managed clusters for real-time processing using Kafka.
• Set up multi-hop, fan-in, and fan-out workflows in Flume to streamline data processing.
• Imported transactional logs from web servers into HDFS using Flume.
• Developed custom sterilizers and interceptors in Flume to mask confidential data and filter unwanted records from event payloads.
Data Analytics and Machine Learning:
• Designed and implemented data analytics solutions in the Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, and Kafka.
• Converted Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala.
• Developed parser and loader MapReduce applications to extract data from HDFS and store it in HBase and Hive.
• Wrote user-defined functions (UDFs) in Hadoop PySpark for data transformations and loads.
• Built various machine learning models (Logistic Regression, KNN, Gradient Boosting) using Pandas, NumPy, Seaborn, Matplotlib, and Scikit-learn in Python.
• Experimented with ensemble methods to enhance model accuracy, deploying models on AWS.
Database Management and Development:
• Extracted data from Teradata into HDFS/Dashboards using Spark Streaming.
• Worked on MongoDB database concepts including locking, transactions, indexes, sharing, replication, and schema design.
• Automated RabbitMQ cluster installations and configurations using Python/Bash.
• Workflow Automation and Batch Processing
• Utilized OOZIE for batch processing and dynamically scheduling workflows.