top of page
Bike Fixie

Data Analysis

Data analyst is the an attractive job in the new digital era. The key is to use the right data to perform the right analysis to answer the right questions in the right form, and report to the right persons to help them to make the right decisions at the right time and right place. 

​

Here is a list of my data analysis projects, in the form of Jupyter Notebooks. All notebooks and dataset are available on GitHub.

Image by Jared Rice

01

Focus on webscraping, data exploratory analysis and user interaction.

​

Some interest findings: 

  • The relationship between source and the mean of average price: the recommendation hotels from booking.com has the highest mean of average price.

  • The relationship between source and the average hotel rating: Hotels.com has the lowest average hotel rating. 

02

Focus on creating a classification model that can predict whether a client of a bank will positively respond to a marketing campaign and invest some amount of money.

Image by Viktor Forgacs
Working Over Coffee

03

Focus on two classification models, model performance evaluation, and model optimization.​

​

General solutions:

  • Descriptive analysis

  • Exploratory data analysis

  • Split data into train dataset and test dataset

  • Train two models (Logistic regression and Random forest) and Performance Evaluation

  • Optimize Random Forest Model

  • Determine the most important features (predictors)

04

Focus on

  • Form research questions

  • Scraping data from Twitter using API and Tweepy

  • Descriptive Analysis

  • Network Analysis

  • Community Detection

  • Centrality Measures

​

nathan-dumlao-4FHF4kCnj8A-unsplash.jpg
Fun Cycle

05

Focus on building a predictive model to estimate demand for bicycles. 

​

General Solution design and implementation: 

  • Exploratory data analysis
    (Identify the variables, and understand the relationship between target column and the other variables.)

  • Develop a regression model to predict the target column.

  • Conclusions or Decisions.

06

Focus on association rules analysis, such as extract frequent item-sets, create association rules and recommendations.

Image by Bruno Kelzer
Modular House

07

Focus on building a predictive model to estimate selling price for houses.

​

General Solution design and implementation: 

  • Exploratory data analysis: try to understand the different variables in the data. Identify the variables that have an effect on the price of the house.

  • Develop a decision tree regression model to predict the selling price for new houses on the market.

  • Try to optimize the parameters to get the best MSE.

  • Conclusions or Decisions.

08

Analysing the dataset from different perspectives, with the main focus on understanding what are the most important variables that determine the price of a listing.

  • Descriptive analysis.

  • Build a prediction model with random forest regression.

  • Textual data analysis.

    • Generate 10 most frequent words that appear in the listing column.

    • Remove stopwords. ( English stopwords and other expressions to get some meaningful results)

    • Finally, test whether the regression model that I created in the previous step can be improved by including these 10 new columns as predictors.

Image by Joe Taylor

09

Doctor's Visit

Focus on building different classification models to estimate whether a patient has heart attack or not.

​

General Solution design and implementation: 

  • Try to understand the different variables using some statistical measures and some visualization tools.

  • Make the prediction of "outcome" column with 4 classification models.

  • Optimize the models for the parameters and try to find the model with best accuracy.

  • Identify the four most important predictors according to the best decision tree model.

  • Create a new decision tree model that uses only those four variables as predictors, and try to make better accuracy.

10

Focus on network analysis on a dataset related to the characters in Marvel Comics.

​

  • Calculate different centrality measures for the network (degree centrality, betweenness, closeness, PageRank).

  • Compare the results.

  • Calculate the number of common characters of the top 10 characters for each pair of methods.

  • Calculate the correlations between the measures. (Try to understand which methods are the most similar to each other in terms of the most central characters and in terms of correlation?)

Image by Keren Fedida
Reading a newspaper

11

Focus on textual data analysis on a dataset contains  337 articles published on Medium about the topics ML, AI, and data science.

​

General Solution design and implementation: 

  • Data Preprocessing

  • Update stopwords and remove stopwords.

  • Count the most frequent words.

  • Topic Modeling

12

Focus on performing K-Means clustering on a patient dataset, and determine the optimal number of clusters (elbow method​).

Nurse Form
bottom of page