Integrating Ethics in a Data Science Project:
10 Questions a Data Science Project Team Should Ask
While the potential ethics issues that might arise when using data science and artificial intelligence has certainly been in the popular press recently, there is not been as much discussion with respect to how a data science team should incorporate ethics within a project. Furthermore, despite this recent publicity, via a recent survey of data scientists, I found that many data scientists still do not focus on the ethical conundrums that they might encounter.
In short, to act responsibly, data scientists must integrate ethics their project. For example, a data science project team may be asked to develop a model to predict the healthcare cost of prospective employees by tracking and analyzing eating habits and exercise routines of those potential employees. In order to properly address this type of project, the data scientists team, including the manager of that data science team, needs to understand a range of underlying ethical issues such as fairness (what training data should be used to ensure there is no gender bias in the machine learning system used to rank job applications) and privacy (is it okay to data mine “public” social media data to train models that infer personal attributes and habits). Given this example, the data science team would need to work together to approach these potential dilemmas thoughtfully.
To help ensure ethics is considered during a data science project, below are 10 questions a data science manager should ask, to ensure potential ethics conundrums are identified and discussed.
Q1: Which laws and regulations might be applicable to our project?
It is important to consider which laws and regulations might be relevant, what these laws are designed to protect or accomplish, and what the impact may be of not taking them into account. This includes considering recent regulations such as GDPR (the European General Data Protection Regulation).
Q2: How are we achieving ethical accountability?
It should be clear who will be accountable to minimize the harm that could be done by the project. Accountability includes ensuring the project team proactively identifies potential stakeholders and evaluates harms such as possible disproportionate effects that may arise from the application of a model.
Q3: How might the legal rights of an individual be impinged by our use of data?
For the project to be ethical, the organization must have the right to use the data for their specific purpose. For example, privacy issues should not only focus on who owns the collected data, but also the rights that need to applied to downstream users that data.
Q4: How might individuals’ privacy and anonymity be impinged via our aggregation and linking of data?
While the need for anonymity is not new to the computing field, the thought process with respect to how to ensure anonymity must be re-examined with the emergence of advanced data science linking techniques. In short, consideration should be given to how privacy will be maintained through the transmission, storage and merging of the data.
Q5: How do we know that the data is ethically available for its intended use?
Being able to access and collect data does not mean that it is ethical to use that data. Hence, care must be taken to understand who owns the data, what are their rights and expectations, and is the data being used the way that it the person (or entity) that contributed the data intended?
Q6: How do we know that the data is valid for its intended use?
A data science project should ensure that the data that is used for the project is suitable for the intended use within the project. One aspect of data validity is data accuracy. For example, imputing missing values or excluding records with missing values could have a significant impact on the downstream analytical results (which might amplify bias). Another data validity concern is related to ‘fitness of purpose’ with respect to how specific data will be used.
Q7: How have we identified and minimized any bias in the data or in the model?
Data science machine learning models can be built using data that has a bias, and thus, the model might also learn this bias (for example, the use of machine learning algorithms has shown the capability of inheriting racial and gender biases). In other words, bias might come from the fact that the data used to build the model was biased. Thus, the data science team needs to be aware that the choices with respect to training data might have profound impacts on others.
Q8: How was any potential modeler bias identified, and if appropriate, mitigated?
There could be the subjectivity within the model building process, in that model building involves subjective decisions, and that these decisions can result in biases and prejudices. In short, there can be subjectivity when decisions must be made within the project, such as with respect to what metric one should optimize, which algorithm to use, which data sources to use, or if one data point should be used as a proxy for a missing fact.
Q9: How transparent does the model need to be and how is that transparency achieved?
An explanation in understandable terms as to why a specific decision is recommended often cannot be supplied — even by the team that build the model. This makes explainability and comprehensibility very difficult. Thus, many models are effectively a black box. Model transparency is particularly important when model output might disadvantage a certain subgroup (or appear to disadvantage a specific subgroup), or in situations where there is a high degree of regulation or a right of challenge (e.g, lending money).
Q10: What are likely misinterpretations of the results and what can be done to prevent those misinterpretations?
Most predictive models are statistical in nature. They provide no guarantees; rather, they tell us about areas where an increased probability of an outcome might guide us to act differently. With this in mind, the data science project manager should ensure that the analytical decisions made as a result of a data science project reflects the scale, accuracy and precision of the data that was used in creating the model.