Projects

INCOME CLASSIFICATION

A core machine learning project predicting individual income ($50K+ threshold) using Census data. We leveraged raw-coded models (Logistic Regression, Decision Tree, Naive Bayes, Neural Network) to analyze socio-economic determinants, with Naive Bayes achieving 79% accuracy. The results offer actionable insights for government resource and policy planning.

SCALABLE YOUTUBE TREND ANALYSIS & CLOUD DATA PIPELINE

Designed and implemented a robust, serverless data management and analysis system to explore patterns and preferences in global YouTube video popularity. The pipeline uses AWS S3, IAM, Glue, and Lambda to securely integrate and transform large, diverse datasets. Key technical steps include serverless computation to convert JSON to Parquet format, enabling high-performance analysis and querying via AWS Athena (SQL) for deeper audience insights.

E-COMMERCE DATABASE

The goal of this project was to establish a scalable and reliable database foundation for a core e-commerce platform. By applying best practices in database design and normalization, the system ensures data integrity and minimizes redundancy. Key technical considerations included designing appropriate entity relationships, implementing proper indexing, and executing efficient querying to maximize system performance and support core e-commerce functionalities (e.g., product catalog, orders, and users).

TECH LAYOFFS ANALYSIS

This comprehensive project analyzed global layoff data (2019–2022) to identify patterns, regional impacts, and key determinants of company susceptibility. By examining industry and company scale trends, the analysis revealed the United States and India as the hardest-hit countries, with the Retail and Transportation sectors (led by Amazon) being most adversely affected. The analysis provides critical insights into the socio-economic impact of market volatility.

CUISINE CLASSIFICATION & FOOD FRESHNESS DETECTION

Developed a low-cost IoT device that attaches to a refrigerator to track food items and predict expiry dates. This innovative solution leverages Machine Learning (ML) for item identification and Raspberry Pi + Camera for data capture, sending information to an Amazon EC2 server for classification and expiry prediction. Users receive timely reminders via a web application (to-do list functionality), effectively reducing food waste through the integration of AI, IoT, and web programming.

PREDICTION OF PATIENTS’ HOSPITAL STAY DURATION

This project addresses challenges in patient care and hospital resource management by predicting patient Length of Hospital Stay (LOS) at the time of admission. Leveraging health records and advanced ML techniques (analyzing factors like age and illness severity), the model optimizes hospital operations and resource allocation. The Random Forest Classifier achieved superior performance with 99% accuracy, demonstrating a significant capability to enhance patient care and reduce operational costs.

HUMAN ACTIVITY MONITORING

This project performs comprehensive Time Series Analysis on accelerometer data from 15 subjects to monitor human activity (walking, running, climbing). The methodology involves transforming time series into Complex Networks using NVG and HVG, followed by in-depth analysis of metrics like Permutation Entropy and Complexity. This approach successfully reveals structural differences in the data, providing detailed, high-resolution insights for Human Activity Recognition (HAR).

DVD DATABASE AND DATA WAREHOUSE

Designed and implemented a scalable DVD Rental Database and Data Warehouse to streamline operations and business insights. Created robust ETL pipelines using Talend to efficiently integrate and transform data loaded from PostgreSQL. This architecture significantly enhanced data processing efficiency and was leveraged for in-depth Tableau analysis, supporting evolving business needs and future data volume growth.

KEYWORD NETWORK & WORD FREQUENCY ANALYSIS

This project utilized Natural Language Processing (NLP) and Network Analysis to extract and transform unstructured text data into meaningful, structured insights. Key tasks involved: 1) Analyzing word frequency, distribution, and co-occurrence patterns in Elon Musk’s Twitter data. 2) Applying Keyword Network Analysis to transform textual data into a weighted network, computing metrics like node degree and strength to reveal structural relationships. The resulting statistical analyses and visualizations provide clear insights into complex linguistic and social patterns.

END-TO-END MLOPS PIPELINE : SCALABLE SENTIMENT ANALYSIS

Engineered a robust, end-to-end MLOps pipeline for sentiment analysis within GCP (Google Cloud Platform). The system uses Apache Airflow and Kubernetes for scalable orchestration, automating the processing and validation (TFDV) of 338 million records. Key features include CI/CD integration, real-time monitoring via the ELK stack, and advanced modeling techniques like Snorkel (achieving a 75% accuracy uplift). The pipeline culminates in a RAG-based summarizer (OpenAI/Pinecone) to generate strategic insights, delivered through a Tableau dashboard and Streamlit interface.

GLOBAL RETAIL SALES INSIGHTS: ADVANCED POWER BI & DATA MODELING

Engineered a comprehensive Business Intelligence (BI) solution to track global retail sales performance. The project involved pre-processing over 100K records using Power Query and designing a normalized star schema for efficient data linkage across sales, customer, and product entities. A highly interactive Power BI dashboard was built using custom DAX measures (including YoY growth and profit ratio) to deliver strategic insights on revenue, profit margin, and regional trends, enabling leaders to make data-driven decisions.

FAKE NEWS CLASSIFICATION USING MACHINE LEARNING

Developed a Fake News Classification system using Python and Machine Learning models. The project involved pre-processing the LIAR dataset and applying TF-IDF vectorization for high-dimensional feature extraction. Four ML models (Logistic Regression, Naïve Bayes, SVM, Random Forest) were trained and rigorously optimized using hyper-parameter tuning, with performance assessed using comprehensive metrics (accuracy, precision, recall, F1 score).