Near the beginning of the semester (Aug 31), I attended a talk at the iSchool by Dr. Jingrui He about a machine learning solution her team crafted to address the issue of data heterogeneity.

Dr. Jingrui He’s faculty homepage

As I have only the basic understanding of machine learning, many of the technical ideas flew right over my head (although I do have to admit that she did an excellent job of explaining the basic concept of how her algorithm worked in a way that I could understand). However, during her talk my brain wandered in a different direction. I may be trying my best to learn as many technical skills as I can during my master’s, but at my core I am a humanist. And so as she was talking about the good her algorithm could do across various industries with heterogeneous data, I was thinking about the ethics behind acquiring that data.

What is data heterogeneity?

Think about the difference between the biological/chemical terms “homogeneous” and “heterogeneous.” “Homo” means the same, and “homogeneous” means having the same (or uniform) makeup (or composition). “Hetero” is the opposite, and means different - so “heterogeneous” means having a non-uniform composition. Heterogeneity is simply referring to something having the quality or characteristic of being diverse. So data heterogeneity, also known as data variety, refers to data that come from many different sources and exists in many different forms. Data Variety is one of the three core “V”s of Big Data.

What is big data?

Let’s look at an example: When analyzing someone’s health, their medical record is probably a good place to start. But you may also want to look at their activity on health-centered social networking sites, their search history on health-centered websites, and their purchase history at pharmacies. This would give you a much fuller picture of their health. But it also raises an important question - one that was not discussed during her talk.

How do you acquire all of this data from disparate and (supposedly) protected sources in order to analyze it?

The Issue of Data Privacy

Right now, there are a number of major news stories in the public consciousness that concern data privacy. Most notably at the moment, the scandal involving Facebook and Cambridge Analytica… but this issue has been discussed ever since Edward Snowden leaked classified information about the NSA’s surveillance policies in 2013.

Facebook data privacy scandal: A cheat sheet

Everything you need to know about PRISM

Data and Goliath (book review)

This is an issue that spans both the public and private realms, with abuses by governments and corporations alike. There are no heroes in this story - or are there?

Librarians are your Data Privacy Heroes

What is the one group of people who will fight tooth and nail to both provide you with all of the information and knowledge you may want while simultaneously defending your personal information from anybody who comes knocking? Librarians.

“We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.”
ALA (American Library Association) Code of Ethics

ALA on Privacy

Librarians may technically be (local) government employees, but if the police or the FBI asks a librarian to hand over the reading history of a suspected individual, the librarian will respond, “What reading history?” Libraries will deliberately purge their records to protect the privacy of their patrons.

You are not what you read

But what does this have to do with the data privacy scandals of the past few years? It means that librarians have an important role to play in this age of big data.

Libraries are community centers, and beyond that they are a source of continuous, life-long learning. Information literacy is becoming as important for adults as reading literacy is for children.

“Research has shown that people need education in developing skills that will help them use the Internet effectively. Libraries can serve as primary training providers to help meet this need.”
ALA (American Library Association)

ALA on Information Literacy/Digital Citizenship

Librarians are one of the few kinds of people that we trust to protect our data and privacy, simply because it is the right thing to do. And I hope that in the public debate surrounding data privacy, librarians will step forward and act as advocates and educators for their communities - informing people about the problem, researching solutions, and advocating for those solutions at the highest levels of government and policy-making.

Keeping this key role of librarians in mind, let’s move on to the second part of this article.

Data Ownership

Let’s circle back to the initial thought that inspired this article. How do you get access to all of this heterogeneous data in order to analyze it?

Corporate Ownership of Data

In the current system, corporations own most of the data about you. Have you ever heard the saying, “If you’re not paying for the product, you’re the product”? Think about how many services you use today that are free: Facebook, Twitter, and pretty much any social network… the whole suite of Google products (search, email, Calendar, Drive, YouTube, etc)… most online services are offered to users free of charge. Think about all of those free apps on your phone.

“If you are not paying for it, you’re not the customer; you’re the product being sold.”
Andrew Lewis, Meta Filter

Now think, out of all of those services, how many times did you actually read the Terms & Services before checking “Agree”? Do you know what exactly you are agreeing to?

When you use these products, you are creating data for them. They can use this data that you created or entered (name, birthday, etc) to then target you with advertising.

How Google and Facebook Have Taken Over the Digital Ad Industry

But not only can they use the data that you created on their product for themselves, but they can then also sell that data to other companies. So who owns the data? You may have created it, but they own it - and they reap the profits.

Now you may think, so what? I get to use these products for free, and I can just ignore the advertising if I want to (or use an adblocker). Why does it matter that they get the data? It’s not hurting anybody, and companies need to turn a profit in order to create and keep jobs. Well…

Automation, AI, & Data

The whole world is currently experiencing an automation revolution. Manufacturing jobs aren’t disappearing because they are all being shipped overseas or being stolen by immigrants - most are going to the machines.

“From 2000 to 2010 alone, 5.6 million jobs disappeared. Interestingly, though, only 13 percent of those jobs were lost due to international trade. The vast remainder, 85 percent of job losses, stemmed from “productivity growth” — another way of saying machines replacing human workers.””
Industrial robots will replace manufacturing jobs - and that’s a good thing

The next major industry to be disrupted will likely be transportation - automated cars & trucks are just around the corner. This may be a scary thought - jobs disappearing! - but it shouldn’t be. When robots and AI systems can take over the boring, monotonous, or dangerous jobs, it frees people up to do more creative things - jobs robots can’t perform (yet!) - and more effectively utilize our collective brainpower.

Automation Could Kill 73 Million US Jobs by 2030

The painful part will be the transition. People will need to be re-trained for jobs in different sectors, which will require a massive rethinking of adult education. There will be long stretches of time during which people are unemployed. Their new jobs may not pay as well, or as steadily. They may need to rely on savings while re-training for their new jobs - and most Americans don’t have enough savings to lean on for more than a month.

The Dangerous State of Americans’ Savings

Automation will bring great things for the future, but it’s going to get worse before it gets better. And what technology does automation rely on? AI and machine learning. And what does AI and machine learning rely on? Data. Also lots of talented programmers, but their algorithms wouldn’t have anything to run on without data. And where does that data come from? Us. Our data, owned by the corporations, isn’t just good for targeted advertising - it’s also essential for developing AI.

Google sells the future, powered by your personal data

Major corporations are using the data we produce using their (free & paid) products to develop the technology that will massively disrupt labor in a way we haven’t seen since the industrial revolution. And again, this isn’t necessarily a bad thing! But it is dangerous, especially when an outsized portion of the profits stand to go to the (already stupid rich) corporations, and not to the huge masses of data-producing people who made AI possible (not to mention the programmers who created the AI and ML systems, but have no ownership of the end product).

A Solution: Personal Ownership of Data

What if we owned our data? Imagine a world in which you receive no spam calls - because those spammers don’t have access to your telephone number… since you never directly gave it to them (although you could choose to sell them your number). Is a system like this even possible? On the one hand, having greater control over our personal data would solve a lot of the issues with the current system (in which corporations use your data for their profit). But on the other hand, data at its core is information - and is it really possible to own information? You can own a book, but does that mean you own the information in that book? Ownership implies exclusivity, and information by its very nature defies exclusivity.

The reality is, our current system is not set up for personal ownership of our data. But, that doesn’t mean that future societies and technologies couldn’t be set up in such a way. What happens when devices stop being worn on our bodies, but become a part of our bodies? When the chain of information is recorded every step of the way? When our personal identity becomes synonymous with a fingerprint, or a microchip, or a system of nanobots? How will society change to reflect new ethics surrounding data ownership? Right now, pondering these questions is the domain of science fiction - but it is possible that they may become very real, very present questions within our lifetime.

So personal data ownership is something that we should consider, even if the current technological infrastructure is headed in the opposite direction. The most ethical path is rarely the easiest… or the most profitable.

Another Solution: Universal Basic Income

Silicon Valley is not ignorant of the issues I’ve raised in this article. One popular solution that they (and multiple governments around the world) have decided to test is Universal Basic Income.

The Paradox of Universal Basic Income

It’s a controversial issue that is not entirely understood - which is why the numerous UBI experiments going on around the world are so important. Y Combinator (a startup accelerator) is funding one project, and so are local governments in multiple countries. UBI is a very complex issue that I don’t have space to fully address, but if you are interested in learning more I highly suggest Rutger Bregman’s book “Utopia for Realists” (you can find my thoughts on the Books page!).

The main concern about UBI is how to pay for it - and one popular suggestion is to (essentially) let the robots pay for it. In essence, tax the companies that use automated systems instead of people (and thus take jobs away from people) to pay for citizens’ UBI. There are, of course, obvious problems with this - how do you measure the potential job loss and tax amount? What if this discourages automation? How do you prevent this from disproportionately effecting small businesses (as large corporations often find loopholes to escape their tax responsibilities)? But the general sentiment - that the rich corporations developing the AI that will disrupt labor should bear some of the cost of that disruption (if only to prevent an economic meltdown) - speaks to a powerful utopian idea.

“The idea of guaranteeing a basic income for everybody has many obvious flaws but one overwhelming virtue. It enshrines the principle that every citizen is a valued member of society and has a right to share in its collective wealth.”
John Thornhill, Financial Times

Why Facebook Should Pay Us a Basic Income

In Conclusion

As you can see, most of this messy trail of thoughts had nothing to do with Dr. He’s talk on solving data heterogeneity. But if there is one important point that I want to leave you with, it’s this: in every technical solution to a problem, there are ethical implications that you must consider as well. The engineer’s work should never be divorced from the humanists’, but rather tightly married. I hope that as Dr. He’s work progresses, she has members on her team that are considering these issues - not just how to acquire the heterogeneous data, but the most ethical way to do so as well.

Contemplating the societal effects of new technologies is intensely interesting to me - and one reason why I love science fiction so much. It’s also likely to be a recurring theme in future blog posts… so if you like debating futuristic ethics, stay tuned!