A brief exploration of Brazilian Higher Education Data using Python

Parts 1 and 2: Data Extraction and Cleaning

Thiago Cardoso
5 min readSep 6, 2020

Introduction

This study seeks to organize and develop a basic analysis of higher education data available for Brazilians public and private institutions. This is my first data science/exploration project using Python, therefore the analysis and codes showcased are far from being a repository of best practice. Any suggestions to improve this project are appreciated.

For better organization and comprehension, this study is divided into 3 parts:

Part 1 — Data Extraction: Description of datasets used in the project, its sources and the path to download it. This post will be updated if future analysis demands new datasets.

Part 2 — Data Cleaning: Code used to clean data and standardize column names. The main goal is to have comparable panel data, with annual information for courses, students, and institutions in the last decade.

Part 3 — Data Analysis: The analysis is subdivided into 6 sections. The first section, available here, analysis the historical panorama of higher education growth in Brazil, especially in the last 10 years.

Jupyter notebooks with python code for each part are available in this Github repository. Much of the description written in this post is also in the notebooks.

Some important considerations:

  • English is not my mother language. Sorry for the mistakes;
  • Many (maybe most) code lines lack consistency, performance and/or efficiency. Did my best to conciliate productivity and code quality. Any suggestions to improve code lines are welcome;
  • The Analysis and Data Cleaning only scratch the surface of the extremely rich data used in this study. Any author seeking to further this study fell free to contact me. I can also help with any translation issue and provide information for additional sources of data in Brazil.

Ok. Let’s get started

Library Import

To extract and organize files the following libraries were imported:

import os
import shutil
import time
import webbrowser
from os import path

The Data

Brazilian Higher Education Census — 1995 to 2018

Data from Brazilian public and private higher education institutions (HEI) are collected and organized by the “National Institute for Educational Studies and Research Anísio Teixeira (INEP)”. Since 2009, INEP publishes the Higher Education Census microdata at student, teacher, course and institution level. Here is a brief description of the Higher Education Census provided by INEP:

Annually, this initiative collects data on higher education in the country, including graduate and sequential courses – both attendance and distance learning courses, providing a "radiography" of this educational level.Higher education institutions fill out the census online forms and, based on this data, provide policy makers with an overview of educational policy trends. By collecting data regarding the number of enrollment and graduates, candidates to university entrance; information on faculty – by qualifications and contract nature – as well as on administrative and support staff; financial data and infrastructure, this initiative provides valuable information about an educational level that is perceived to be in a process of expanding and diversifying.

From 1995 to 2008, INEP published data only at course and institution level. In this study, we focus on extracting, cleaning, and analyzing data from 2009–2018. Further work needs to be made in order to organize 1995 to 2008 data.

The code bellow downloads all Higher Education Census files, from 1995 to 2018, from the INEP website, one at a time.

National Assessment of Student Achievement (ENADE) — 2004 to 2018

ENADE is an external evaluation applied annually to assess undergraduate students learning in their final year. Each course is assessed every three years since programs are grouped in three representative areas and each year one group is assessed.

ENADE microdata provides information at the student level (unidentified), regarding each student's answers and results in the content assessment and also its answers to a socioeconomic and course quality perception questionary.

I still did not work on ENADE data. However, I already extracted it for future analysis. The code bellow download all ENADE files, from 2004 to 2018, from INEP website, one at a time

Preliminary Course concept (CPC)

CPC is an index, ranging from 1 to 5, calculated by INEP to assess undergraduate course quality. In this work, only 2009 and 2018 CPC data is used.

The 2018 CPC is calculated according to the following aspects and weights:

  • 20% — Students performance in ENADE
  • 35% — Difference between expected and observed performance in ENADE (expected performance is calculated considering average performance in courses with students from similar background)
  • 7,5% — Proportion of teachers with an MA degree
  • 15% — Proportion of teachers with Ph.D. degree
  • 7,5% — Teachers work regime
  • 15% — Teachers' perception regarding infrastructure, pedagogical organization and opportunities to improve academic and professional formation.

2009 CPC has slight variations in some weights

CPC data from 2010 to 2017 can be found here

Further information for the CPC can be found here (in Portuguese)

General Course Index (IGC)

The IGC provides a quality index for higher education institutions based on the average CPC of the last three years, the distribution of students between undergrad and graduate levels, and the average evaluation of graduate courses.

Further information for the IGC can be found here (in Portuguese)

Because IGC file names differ a lot, I wrote the laziest code possible to download all files.

Data Organization

There is a huge variety in zip files folders paths, file names, and formats across years. Therefore, manual work had to be employed in order to organize files.

In this manual process of data organization, the following file names were chosen:

  • For HEC data:
    student level files: alunos_’year’;
    course level files: cursos_’year’;
    university level files: ies_’year’
  • For IGC data: igc_’year’
  • For CPC data cpc_’year’

The code bellow automatizes the first-mile of this process for the HEC files. Namely:

  • i) create a data/csv_bases folder in the project folder;
  • ii) standardize HEC file names;
  • iii) transfer downloaded HEC files to data/csv_bases in the project folder;
  • iv) extract files from .zip

Data Cleaning

All codes used to clean HEC, IGC and CPC datasets, and standardize column names and values in order to append year files in the same .csv are available in this Jupyter notebook.

I still need to improve a lot the code quality in this part. Unfortunately, INEP likes to change column names almost every year, demanding a long time of dictionary reading.

The final result of all the cleaning is the following .csv files:

  • cursos.csv: info regarding the number of enrolled, freshman, graduated, seats, and applicants per course from 2009 to 2018. This data frame also contains course characteristics as administrative category (private x public), modality (on-line x in-class), location and course field.
  • ies.csv: info regarding total costs and revenues per institution from 2010 (2009 file do not have budget info) to 2018.
  • cpc.csv: CPC index for all evaluated courses in 2009 and 2018
  • igc.csv: IGC index for all evaluated institutions from 2009 to 2018

I also worked in cleaning students' data frames from the HEC. However, I still need to work in the code description and organization.

Finally, the first section of data analysis is available in this next post.

--

--