Did you know that most of the hiring processes in big companies are complemented by some machine learning algorithms, in charge of pre-selecting the best resumes before handing them to a human? (1) That some banks use machine learning to predict whether or not you will be able to reimburse a credit? (2) That most online customer support usually start the interactions with an intelligent agent rather than a human? (3) Because of the vast amount of applications we are confronted to in our daily life, a basic knowledge of data science is essential to understand how decisions are made and what we, as a member of society, can do to maximize the efficiency and minimise the risks of these processes.
That is why public awareness on the field of computer science technology is of great importance, and science enthusiasts should not be the only ones to be informed about these challenges and pitfalls. Education should offer learning opportunities to a broad audience, by creating resources that are understandable with no to little scientific background. With that in mind, we want to design an online course about machine learning that is accessible for everybody, with different levels of difficulty.
Targeting everyone, with or without prior knowledge in the area of computer science, the course implements two levels of difficulty: “beginner”, requiring neither background in machine learning nor coding skills, and “advanced”, for learners with basic coding skills. The goal of both is the exploration of the problems with data quality for machine learning, in several steps: assess the quality of the data, clean the data, observe the influence of the cleaning on the performance of the model.
The course presents itself in the form of a book, with a menu on the left to access any page of it.
It is comprised of 3 chapters, each exploring data quality for a type of data: the first chapter deals with numerical data, the second with image data, and in the third chapter, we explore text data.
An introduction contains introductory content that helps understanding the concepts of the course material.
Each chapter contains subsections, and at the end of each page of the book, a quiz is implemented to test the newly acquired knowledge.
Each subsection contains a summary video that wraps up everything that has been seen in the section.
At the end of each chapter, a practical task is available for the learners of the advanced content. A general review quiz is here for everyone.
The appendix contains additional non-essential information, for the curious learners. The pages present the transformations that have been done on the raw data to be used in the course and the algorithms used.
The learners understand how to measure and assess errors in a dataset, such as noise, missing or corrupted data.
The participants learn techniques for data preparation and data cleaning, to increase the quality of a dataset for its further use in a machine learning experiment.
Participants with a basic knowledge of programming use code and learn method from diverse libraries of Python. Participants without prerequisite can directly visualize the results of the data and their changes on interactive graphs.
The book can be fully followed in an internet browser.