Data Mining

With Python

January 30, 2020
Salahaddin University-Erbil
Software Engineering Dept.
MSc
2020
5 mins read

General Information

  • University: Salahaddin University-Erbil
  • Department: Software Engineering Dept.
  • My Status: Lecturer
  • Level: MSc
  • Year: 2020

Course Description

Data Mining studies algorithms and computational paradigms that allow computers to find patterns and regularities in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. It is currently regarded as the key element of a more general process called Knowledge Discovery that deals with extracting useful knowledge from raw data. The knowledge discovery process includes data selection, cleaning, coding, using different statistical and machine learning techniques, and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples. Special emphasis will be give to the Machine Learning methods as they provide the real knowledge discovery tools. Important related technologies, as data warehousing and on-line analytical processing (OLAP) will be also discussed. The students will use recent Data Mining software. Enrollment in this course is limited to 15 students.

Prerequisites

  • Statistics and Probability
  • Linear Algebra
  • Programming Fundamentals (Python preferred)
  • Database Systems (or equivalent)
  • Machine Learning basics (recommended)

Course Objectives

Upon completion of this course, students will be able to:

  • Understand the fundamental concepts and principles of data mining and knowledge discovery.
  • Apply appropriate data preprocessing techniques for different types of data.
  • Implement and evaluate various data mining algorithms using Python.
  • Analyze and interpret data mining results effectively.
  • Design and execute complete data mining projects from data collection to result interpretation.
  • Compare and select appropriate algorithms for different data mining tasks.
  • Understand the ethical considerations and limitations of data mining applications.

Course Outline

Week 1: Introduction to Data Mining and Knowledge Discovery

  • Definition and scope of data mining
  • Knowledge Discovery in Databases (KDD) process
  • Data mining applications and challenges
  • Data mining vs. traditional statistical analysis
  • Ethical considerations in data mining
  • Lab: Setting up Python environment and data mining tools

Week 2: Data Preprocessing and Quality

  • Data cleaning and transformation techniques
  • Handling missing values and outliers
  • Data normalization and standardization
  • Feature selection and dimensionality reduction
  • Lab: Data preprocessing with Python libraries

Week 3: Exploratory Data Analysis

  • Descriptive statistics and data visualization
  • Correlation analysis and feature relationships
  • Data distribution analysis
  • Outlier detection methods
  • Lab: EDA using pandas, matplotlib, and seaborn

Week 4: Association Rule Mining

  • Market basket analysis concepts
  • Apriori algorithm and its variants
  • Support, confidence, and lift measures
  • Frequent itemset mining algorithms
  • Lab: Association rule mining implementation

Week 5: Classification Algorithms - Part 1

  • Decision tree algorithms (ID3, C4.5, CART)
  • Naive Bayes classification
  • Classification performance evaluation metrics
  • Model validation techniques
  • Lab: Decision tree and Naive Bayes implementation

Week 6: Classification Algorithms - Part 2

  • Support Vector Machines (SVM)
  • K-Nearest Neighbors (KNN)
  • Classification algorithm comparison
  • Hyperparameter tuning
  • Lab: SVM and KNN implementation with scikit-learn

Week 7: Midterm Exam and Review

  • Midterm Exam: Covers weeks 1-6 material
  • Review of classification concepts
  • Model evaluation practice
  • Lab: Exam review and model comparison

Week 8: Clustering Analysis

  • K-means clustering algorithm
  • Hierarchical clustering methods
  • Density-based clustering (DBSCAN)
  • Cluster validation and evaluation techniques
  • Lab: Clustering algorithms implementation

Week 9: Advanced Classification Methods

  • Ensemble methods (Bagging, Boosting, Random Forest)
  • Neural networks for classification
  • Multi-class classification strategies
  • Imbalanced data handling techniques
  • Lab: Ensemble methods and neural networks

Week 10: Regression and Prediction

  • Linear and multiple regression
  • Logistic regression for classification
  • Time series analysis and forecasting
  • Regression tree methods
  • Lab: Regression analysis and prediction models

Week 11: Dimensionality Reduction and Feature Engineering

  • Principal Component Analysis (PCA)
  • Singular Value Decomposition (SVD)
  • Feature extraction and construction
  • Feature selection algorithms
  • Lab: Dimensionality reduction techniques

Week 12: Advanced Topics - Part 1

  • Text mining and natural language processing
  • Web mining and social network analysis
  • Big data mining challenges and solutions
  • Lab: Text mining and web scraping

Week 13: Advanced Topics - Part 2

  • Deep learning approaches for data mining
  • Real-world data mining applications
  • Current trends and future directions
  • Lab: Deep learning for data mining

Week 14: Final Exam Preparation and Review

  • Comprehensive review of all course material
  • Practice problems and sample questions
  • Final Exam: Theoretical component
  • Lab: Final exam practice and preparation

Week 15: Final Project and Course Wrap-up

  • Final Exam: Practical data mining project
  • Course evaluation and feedback
  • Future learning paths and advanced topics
  • Lab: Final project presentation and evaluation

Textbooks

  • [Recommended] “Data Mining: Concepts and Techniques” by Jiawei Han, Micheline Kamber, and Jian Pei
  • [Optional] “Introduction to Data Mining” by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

Assessment

  • Assignments and Projects (30%)
  • Data Mining Project (25%)
  • Midterm Exam (20%)
  • Final Exam (25%)