The dangers of #bigdata and #junkdata. Similar problems with #AI, too. #weaponsofmathdestruction

rajeev2007

Jul 10, 2017

this book review was published in Swarajya magazine, March 2017. it is not online, so here's my actual copy.

lately, i've been seeing a lot of people swear by 'data'. this is another shibboleth with terrible implications. the west has the vanity that by reducing everything to data they can arrive at the truth. that is not true. data becomes information only when contextual information is supplied.

besides, there are problems with 'fat tailed distributions'. we assume, implicitly, that the phenomena we study are under the gaussian bell curve on normal distribution. if they aren't, and are fat tailed, then what we think are unlikely events will happen far more frequently than we thought: thus black swan events.

the other huge problem is the unconscious assumption that 'correlation = causation'.

we mess up on all these fronts.

what i call 'junk data' is data with incorrect assumptions that has been fed into computers, which will produce circular reasoning that 'proves' the assumptions correct. there are also heavily biased data sources that ignore inconvenient data points.

briefly, excessive dependence on the infallibility of computers is a bad idea.

i wrote a companion piece to this in swarajya, on the ethical problems of bias in data selection for AI. https://swarajyamag.com/magazine/fatal-flaw-in-ai-the-robots-will-probably-be-as-biased-as-their-masters

The Danger from our Over-Reliance on Computers and Big Data

Rajeev Srinivasan (Book Review)

Can we trust computers? The evidence is mixed. Those in the business have seen enough ‘kludges’ and bugs that they would, if they were honest, be suspicious of a lot of things spewed out by computers. But do the infernal machines work, more or less, most of the time? They in fact do, and they do useful things, too. It is now possible to mathematically prove at least the core (or kernel) of operating systems, but on average we have to take things on faith.

The average person on the street, however, is often misled into thinking that anything that comes out of a computer must be true, because after it all, it has the weight of all those white-coated types chanting mysterious incantations or whatever that you see in the movies. So if the computer tells you your credit score is a tad low at 500, you take it in all humility and internalize the idea that you are a bit of a deadbeat who can’t get loans.

Not to be a neo-Luddite, but our excessive dependence on computers is a bit worrisome. Our faculties as a species may be getting eroded. For instance, all of us used to do arithmetic in our head, until calculators came around. We used to navigate ourselves, until Google Maps appeared. There was a controversy over a 2008 article in the US monthly the Atlantic “Google Is Making Us Stupid” because now we don’t need to know anything, as we can look it up.

Unfortunately, this is coming precisely as computers are getting smarter. All those chess, Go, and poker players who have been bested by computers can tell you that. In fact, we now have to worry about the ethics of artificial intelligence and self-driving cars, as I mention in a companion piece in this issue. At least we think that’s in the future, but this book, Weapons of Math Destruction (Allen Lane/Penguin Random House, 2016) by Cathy O’Neill, a PhD mathematician-turned-data-scientist, suggests we are already feeling some of the deadly effects of Big Data.

A part of the problem comes from the confusion of statistical correlation with causation. The computer models make assumptions – for example that a broken family is associated with increased tendency towards violent crime – which are in the bowels of the algorithm, and are opaque and cannot be questioned by their victims. These proxies may not have a causal relationship with the outcome, but their use is widespread. A recent study by Daniel Hamermesch et al suggests there is a correlation in the US workforce between race and laziness, but undoubtedly, it will be taken to mean causation that blacks and Hispanics are inherently lazy precisely because they are blacks and Hispanics.

O’Neill’s villains are the big algorithms that increasingly run our lives. In a dystopian vision, she produces example after example of big pieces of software that have become a sort of Deep State, one whose workings are incomprehensible except to the code-jockey boffins who run them; and said boffins often have no idea of the devastation they can wreak on individuals and society.

We are generally familiar with the trading software that has caused ‘flash crashes’ and the algorithms that led to the sub-prime lending debacle in the US. But O’Neill (who had a ring-side view of the market meltdown as a quant jock at hedge fund D E Shaw) points out that that there are several others that have equally sinister outcomes, often because there is a self-fulfilling prophesy – people who are deemed undesirable by algorithms in fact become undesirable as a result.

O’Neill starts with a rating scheme for convicts called LSI (Level of Service Inventory); unfortunately, says she, the parameters used to rate them, and the questions in the questionnaire, are biased against the poor and especially against blacks. Thus unemployment and criminal convictions among friends and family seem reasonable enough questions, but they end up giving them longer terms and likely greater difficulty in finding a job upon release. This creates a vicious cycle.

Then O’Neill goes on to several other case studies, all of which may seem innocuous enough to begin with. But we begin to succumb to total dependence on the ratings spewed out by algorithms, the connection with reality begins to recede. The software guys doing the coding may have no idea of whether the assumptions they are making are appropriate. And the field guys who do know that stuff will soon be defeated by the complexity of the algorithms.

One result is the Black Swan effect that Nassim Nicholas Taleb wrote about so evocatively. Events with very low, but non-zero, probability will soon be excluded from the calculations, with the result that when such events occur (as they did in the 2008 meltdown) the entire edifice on which the algorithms rest will crumble catastrophically.

O’Neill talks of “haphazard data gathering and spurious correlations, reinforced by institutional inequities, and polluted by confirmation bias”. In addition, she wonders if “we’ve eliminated human bias or merely camouflaged it with technology”. She talks of “pernicious feedback loops” that lead to “toxic cycles”, and she concludes that these are the mathematical analogs of Weapons of Mass Destruction, hence the title of the book.

As examples, O’Neill offers up several algorithms. One is used to rate schoolteachers, which seems to grossly distort the incentives for teachers to focus largely, or entirely, on test scores, thus devaluing various other things a good teacher can offer: such as inspiring students, or taking time with a slow starter.

Another is the pernicious role played by a US News and World Report ranking of colleges, which has outlived the magazine itself. Apparently an objective measurement of the ‘quality’ of the college, this metric has now become so widely adopted that colleges focus exclusively on the fifteen parameters it considers. So much so that they went on a spending spree, building stadiums, grand campuses, attracting star football players, and so on. But college fees were not part of the metric; and these soared, as well.

Today, toxic student loans are a huge overhang on the US economy, as big as the bad mortgage problem. In addition, entrepreneurs created rip-off private for-profit colleges

which deliberately targeted poor and military-veteran students, as well as non-whites. Again, clever advertising techniques using Big Data allowed these colleges to sell essentially useless but expensive, loan-led ‘education’ to these people.

O’Neill suggests that greed is a major factor. The folks over on Wall Street who have been making up ever cleverer mathematical models to make money often don’t realize that money comes from screwing over real people, as was the case with sub-prime mortgages and the related credit-default swaps and Collateralized Debt Obligations. Or even if they do, they don’t care. She points out that despite major convulsions, in large part the big Wall Street firms and banks and hedge funds did all right, often at taxpayer expense.

O’Neill goes on to give a litany of other examples of malevolent data exploitation, for instance in hiring, loan processing, worker evaluation, voter targeting and even health monitoring. It’s chilling to think of how these will play out when applied to the relatively trusting and naïve populations of rural India. These algorithms, which are “opaque, unregulated and incontestable”, can be truly weapons of mass destruction. They take your privacy and individualism away from you, and whatever they decide about you, you have no appeal. Truly a frightening Big Brother scenario.

1250 words, 10 February 2017

Shadow Warrior

Discussion about this post