Projects

INCOME CLASSIFICATION

In a time of increasing socio-economic disparities and limited government resources, accurately identifying individuals who require assistance is crucial for effective policy making. This project aims to build a machine learning model that predicts whether an individual’s income exceeds $50,000 per year based on demographic and employment-related attributes. The analysis leverages the Census Income dataset from the UCI Machine Learning Repository. The project implemented multiple raw coded models (built without using Scikit-learn library) including Logistic Regression, Discrete Naive Bayes, Decision Tree and Neural Network, with Naive Bayes achieving the highest accuracy of 79%. The results contribute to a better understanding of the socio-economic determinants of income levels and can assist government agencies in resource allocation.

YOUTUBE DATA ANALYSIS USING AWS

This project addresses the challenges associated with harnessing insights from the vast and diverse landscape of YouTube data. Our aim is to design and implement a robust data management and analysis system, presenting an opportunity to explore patterns, preferences, and variations in video popularity within the YouTube platform, and to gain deeper insights into the dynamics of YouTube trends and audience interactions across diverse geographical locations. AWS Services used include AWS S3, IAM, Glue, Lambda and Athena. Steps include integrating and storing the dataset in AWS and transforming it in a scalable and secure manner. Creating a schema for the data and managing the permission and access controls for different AWS services are also included. Serverless computation to transform JSON data to Parquet format is performed, and the data stored in S3 is accessed using SQL queries.

E-COMMERCE DATABASE

In a world increasingly driven by technology, e-commerce has become a cornerstone of modern shopping experiences. The primary objective of this project is to establish a scalable and reliable database that supports the core functionalities of an e-commerce platform. By carefully designing the database schema and implementing appropriate relationships between entities, we ensure efficient data storage, retrieval, and manipulation, ultimately enhancing the overall performance of the e-commerce system. Throughout the project, best practices for database design and normalization are followed to ensure data integrity, minimize redundancy, and optimized query performance. Proper indexing, efficient querying, and thoughtful consideration of data relationships were the key considerations in the database design process.

TECH LAYOFFS ANALYSIS

The global pandemic of COVID-19 caused unprecedented economic disruption, leading to a significant increase in layoffs across various industries. This project aims to provide a comprehensive analysis of global layoffs that occurred between 2019 and 2022. This project analyzes the patterns and impacts of these layoffs, focusing on identifying the industries and regions most affected and understanding the trends over the years. By examining industry trends, company scales, and geographical patterns, the analysis seeks to identify key determinants that increase a company’s susceptibility to layoffs. The analysis revealed that the United States and India were the hardest-hit countries by layoffs. The peak layoff period was around February 2020. The retail and transportation sectors were identified as the most adversely affected, with Amazon leading the list of companies with the highest number of layoffs.

CUISINE CLASSIFICATION & FOOD FRESHNESS DETECTION:

This project attempts to create a low-cost device that can be easily attached to a refrigerator and can keep track of food items inside it. It uses Machine Learning to identify a particular food item and predicts an expiry date for the same. This information is then used to remind the user about the food item that is about to expire. The project uses concepts of artificial intelligence, internet of things and internet and web programming to achieve its goal. The device consists of a Raspberry PI, connected to a camera, which clicks the picture of the food item and sends the data to the server (Amazon EC2). Here, classification is done, and the expiry date is predicted. This information is then relayed to the end point client via a web app which functions as a to-do list.

PREDICTION OF PATIENTS’ HOSPITAL STAY DURATION

Healthcare organizations are under increasing pressure to improve patient care outcomes and achieve better care. While this situation represents a challenge, it also offers organizations an opportunity to dramatically improve the quality of care by leveraging more value and insights from their data. The project aims to predict the length of hospital stay for patients at the time of admission. By leveraging health records, the project seeks to enhance hospital resource management, optimize operations, reduce costs & improve patient care. Using various data mining & ML techniques, multiple factors like age, severity of illness etc. are analyzed. Random Forest Classifier demonstrated the highest performance with 99% accuracy, 98% precision, and a 99% F1 score. So, the above-mentioned classifier is recommended for predicting hospital stay duration due to its superior accuracy and performance. Implementing this model can significantly enhance hospital resource management and patient care.

HUMAN ACTIVITY MONITORING

This project involves analyzing time series data from accelerometers to monitor human activity. The data includes activities such as walking, running, climbing up, and climbing down for 15 subjects. In this project, we successfully conduct comprehensive Time Series analysis and transform accelerometer time series data into complex networks using NVG and HVG and conduct in-depth analysis of metrics to reveal structural differences in the data. We also conduct extensive analysis of permutation entropy and complexity across various parameters, and explore the impact of these parameters, providing detailed insights into the data’s randomness and complexity. And data visualization including creating scatter plots is also done, to clearly illustrate relationships between metrics, highlighting patterns and trends associated with different activities.

DVD DATABASE AND DATA WAREHOUSE

The problem at hand involves efficiently managing and extracting valuable insights from a DVD rental database to address various business needs and challenges. This project involves designing and implementing a DVD rental database and data warehouse to manage and organize various aspects of a DVD rental business. The project involves successful integration of the DVD Rental Database into Talend, with data loaded from PostgreSQL. Other steps are also included like creating a data warehouse and performing the analysis using Tableau. The implemented ETL pipelines have significantly enhanced data processing efficiency, streamlining the flow of information throughout the system. The project’s architecture is designed with scalability in mind, allowing for future expansion and adaptation to accommodate growing data volumes and evolving business needs.

KEYWORD NETWORK & WORD FREQUENCY ANALYSIS

The primary aim of this project is to extract meaningful insights by transforming data into structured formats and applying various analytical techniques. Objective of the Keyword Network Analysis task is to extract and transform keyword data into a weighted network and analyze its structure by computing and visualizing node degree and strength. Objective of the Twitter Data Analysis task is to analyze word frequencies in Elon Musk’s tweets and apply network and statistical analyses, demonstrating the distribution, ranking of word frequencies, and illustrating the connections between frequently co-occurring words.