My Projects

Finding Fraudsters

Credit card fraud is a big problem in the world of e-commerce, and a difficult one to eliminate. This project uses the dataset obtained from IEEE-CIS Fraud Detection Kaggle competition. I used correlation heatmaps and feature engineering to create the model which got the best AUC score.

  • Developed a fraud detection model using boosted decision trees (BDT).
  • Dealt with a highly imbalanced dataset of 590,540 transactions, of which only 3.5% were fraudulent.
  • Achieved 93% ROC AUC score using ensemble model of XGBoost, LightGBM and CATBoost.
  • Model deployable in real-time transaction screening to automatically flag suspicious activity, reducing financial losses and minimizing manual review workload for fraud analysts.
Project 1

Chicago Crimes

Crime patterns across Chicago's community areas, uncovered through large-scale data analysis and interactive visualization.

  • Analyzed crime data from the Chicago Data Portal which contained 8 million rows and 22 columns.
  • Used Seaborn and Folium for data visualization and building interactive maps.
  • Developed SARIMA time series models to generate actionable predictions for public safety authorities.
Chicago Crimes

ICU Mortality Prediction

A clinical machine learning system trained on the MIMIC-III dataset to predict ICU patient mortality, supporting better resource allocation and patient outcomes.

  • Cleaned and processed MIMIC-III dataset with 2+ million rows of hourly time series data covering 38,597 patients and 49,785 hospital admissions.
  • Built and compared classification models including Logistic Regression, Random Forest, XGBoost, LSTM, and ClinicalBERT across tabular and time-series use cases.
  • Achieved AUC of 0.83; findings demonstrate how ML-driven mortality prediction can reduce ICU burden and improve resource allocation.
ICU Mortality Prediction

Sepsis Prediction in Patients

An early-warning ML model for sepsis detection in ICU patients, designed to give clinicians a critical time advantage before the condition escalates.

  • Cleaned Physionet dataset for early sepsis prediction using high-resolution EHR data with 1.5 million rows.
  • Developed XGBoost classification model achieving AUC of 0.95; applied correlation analysis and feature engineering to EHR data.
  • Early detection model enables clinical intervention before escalation, potentially reducing sepsis mortality rates.
Sepsis Prediction

Sentiment Analysis and Topic Modelling

An NLP pipeline for discovering and tracking evolving public narratives around Artificial Intelligence across a large corpus of media coverage.

  • Built a topic modeling pipeline applied to 10,000 New York Times articles about Artificial Intelligence.
  • Used LLM models including BERT and BERTopic to discover and visualize latent themes across the article corpus.
  • Pipeline enables media organizations or policy teams to automatically track how AI narratives evolve over time.
Sentiment Analysis and Topic Modelling

Predicting Employee Attrition

A predictive HR analytics tool that identifies employees at flight risk, enabling targeted retention efforts before costly turnover occurs.

  • Built supervised models (Logistic Regression, Random Forest, XGBoost) and unsupervised clustering (K-Means) on the IBM HR dataset.
  • Used scaling, one-hot encoding, and clustering to uncover key attrition drivers including income, satisfaction, and tenure.
  • Delivers an early warning system giving HR leaders actionable insight to intervene before losing key employees.
Predicting Employee Attrition

About Me

Physics graduate with 3+ years of experience applying machine learning and quantitative methods to solve complex problems. Fast learner with exceptional attention to detail and a commitment to delivering high-quality results. Seeking opportunities in data science and quantitative analysis to tackle real-world challenges with data-driven solutions.

My Resume