Data Analysis

Data analyst is the an attractive job in the new digital era. The key is to use the right data to perform the right analysis to answer the right questions in the right form, and report to the right persons to help them to make the right decisions at the right time and right place.

Here is a list of my data analysis projects, in the form of Jupyter Notebooks. All notebooks and dataset are available on GitHub.

HOTEL BOOKING PLATFORMS

Focus on webscraping, data exploratory analysis and user interaction.

Some interest findings:

The relationship between source and the mean of average price: the recommendation hotels from booking.com has the highest mean of average price.
The relationship between source and the average hotel rating: Hotels.com has the lowest average hotel rating.

BANK CLIENTS

Focus on creating a classification model that can predict whether a client of a bank will positively respond to a marketing campaign and invest some amount of money.

STUDENTS PERFORMANCE PREDICTION

Focus on two classification models, model performance evaluation, and model optimization.

General solutions:

Descriptive analysis
Exploratory data analysis
Split data into train dataset and test dataset
Train two models (Logistic regression and Random forest) and Performance Evaluation
Optimize Random Forest Model
Determine the most important features (predictors)

NETWORK ANALYSIS OF TWITTER FOLLOWERS

Focus on

Form research questions
Scraping data from Twitter using API and Tweepy
Descriptive Analysis
Network Analysis
Community Detection
Centrality Measures

BIKE SHARING PLATFORM

Focus on building a predictive model to estimate demand for bicycles.

General Solution design and implementation:

Exploratory data analysis
(Identify the variables, and understand the relationship between target column and the other variables.)
Develop a regression model to predict the target column.
Conclusions or Decisions.

MARKET BASKET ANALYSISSIS

Focus on association rules analysis, such as extract frequent item-sets, create association rules and recommendations.

ESTIMITE HOUSE SELLING PRICE

Focus on building a predictive model to estimate selling price for houses.

General Solution design and implementation:

Exploratory data analysis: try to understand the different variables in the data. Identify the variables that have an effect on the price of the house.
Develop a decision tree regression model to predict the selling price for new houses on the market.
Try to optimize the parameters to get the best MSE.
Conclusions or Decisions.

NEWYORK AIRBNB PRICE PREDICTION

Analysing the dataset from different perspectives, with the main focus on understanding what are the most important variables that determine the price of a listing.

Descriptive analysis.
Build a prediction model with random forest regression.
Textual data analysis.
- Generate 10 most frequent words that appear in the listing column.
- Remove stopwords. ( English stopwords and other expressions to get some meaningful results)
- Finally, test whether the regression model that I created in the previous step can be improved by including these 10 new columns as predictors.

CLASSIFICATIONIN HEALTH CARE

Focus on building different classification models to estimate whether a patient has heart attack or not.

General Solution design and implementation:

Try to understand the different variables using some statistical measures and some visualization tools.
Make the prediction of "outcome" column with 4 classification models.
Optimize the models for the parameters and try to find the model with best accuracy.
Identify the four most important predictors according to the best decision tree model.
Create a new decision tree model that uses only those four variables as predictors, and try to make better accuracy.

MARVEL COMICS CHARACTERS NETWORK

Focus on network analysis on a dataset related to the characters in Marvel Comics.

Calculate different centrality measures for the network (degree centrality, betweenness, closeness, PageRank).
Compare the results.
Calculate the number of common characters of the top 10 characters for each pair of methods.
Calculate the correlations between the measures. (Try to understand which methods are the most similar to each other in terms of the most central characters and in terms of correlation?)