The world is more interconnected and more digitally recorded than ever before, and one consequence of this technological shift is that scientists have an unprecedented amount of data to work with. But this has also given rise to new questions, not just about how to use that data, but how to use it optimally and responsibly.
These questions were explored in the symposium “Data Analytics, Social Media and Sustainability”, supported by the Elsevier Foundation at the 28th TWAS General Meeting in Trieste, Italy. The symposium brought together a trio of experts in sustainable development, machine learning and cybersecurity to discuss the promise, potential and challenges presented by the big data era.
The Elsevier Foundation Director Ylann Schemm said the symposium was designed to be a “deep dive” into the importance of harnessing the data revolution for good causes, in particular those pursued by the U.N.’s Sustainable Development Goals (SDGs).
“So many of you are involved in sustainability-oriented research,” Schemm said to the assembled scientists at the meeting, “but we wanted to have a space where we could zoom out and look at some of the larger issues in data analytics.”
Elsevier began expanding into the area of sustainable development to support the efforts around the launch of the United Nations’ SDGs in 2015. One of several initiatives was The Elsevier Foundation-TWAS Sustainability Visiting Expert Programme, for example, helped researchers from the North work with experts in developing countries to improve their work related to sustainability. And the power of data analytics puts information in the hands of decision-makers so that they can support research in developing countries that will help address those ambitious goals.
The global mission of the SDGs
Maria de Kleijn, senior vice president of analytical services for Elsevier who spearheaded a 2018 report, “Artificial Intelligence: How knowledge is created, transferred, and used”, described the goals as missions, with an intimidatingly large number of targets – for example, providing affordable health care to everyone on Earth. Her division investigates analytics, collaboration, and gender in research, and her talk focused on a key issue: how to ensure the right big data research is being done to advance the SDGs.
There is a chain of steps connecting the goals to research, de Kleijn explained. First researchers must have an idea, then find people who will help them, then find money, then a facility, then do the research, analyze it, submit it to a paper, publish it – and then, finally, an impact is possible. Organizing all these steps around such wide targets proves to be a challenge. But, de Kleijn said, big data can help meet that challenge.
“Why is big data even possible?” she said. “It starts with digital communication. A lot of communication we do is no longer analog, it’s suddenly digital. So instead of me needing to write in my journal where I’ve been on holiday, I turn on ‘my location’ on my mobile.”
Digital communication has completely transformed how research is done – how it gathers data and produces findings, and then how it is made available for further analysis. Machines, she said, can find tiny aberrations in huge amounts of data collected from our communications systems, and produce graphics to help human beings understand and interpret that data.
For disaster research, for example, an algorithm can detect almost 30,000 articles across all journals over a five-year period, and then sort them across disaster types. This data can then be contrasted with information about the human and economic toll of disasters.
“If you look at the countries that do the (disaster) research, Japan is one country with both high research and high toll because of the earthquakes,” de Klejin said. “High research impact is in Europe where there’s almost no toll. But countries doing the disaster research, do that research on disaster types that matter to themselves. If you have disasters that are not present in Japan, China, U.S. or Europe, then they are under-researched.”
Where the research is done matters, and local information can be critical. When the ebola epidemic hit Western Africa, the response was managed by the U.S. Centers for Disease Control and Prevention in Atlanta, Georgia, where the researchers focused on where the disease was spreading. But the important realization for stopping the disease was found on the ground by Africans, who realized that the disease was spreading through local burial rituals in which family and friends would literally hug the dead body and become infected themselves. In African medical journals, they noted that 70% of deaths in Guinea are linked to burial rituals. But in the U.S. the effect of rituals was hardly acknowledged at all.
“That was a major cause why the disease kept on spreading,” she said. “It was working with religious leaders and local health professionals together designing a different way to be doing the burials that was respectful to people’s need to say goodbye. You could not figure this out from Atlanta. You needed to be on the ground to figure that out.”
Machine learning: a major innovation driver
It’s not just big data, but also the method of machine learning that is at the forefront of innovation today, said computer scientist V.S. Subrahmanian of Dartmouth College in the U.S. Machine learning is a category of artificial intelligence that automates data analysis. Programmes that use the method learn from the data they analyse to identify patterns and shape models with little human help.
Subrahmanian, a world-renowned expert in artificial intelligence and cybersecurity, said the goal of machine learning is to use large tables of data to calculate probabilities. For example, if a company wants to hire someone, it can take tables full of data and then account for variables such as education and desired salary. Then it can deduce which candidate is most likely to be a good, successful hire. It can also divide units into groups using these variables.
This is useful for the science of global health, which makes it important to achieving Sustainable Development Goal 3. For example, machine learning was key in predicting the spread of ebola in Western Africa. Subrahmanian’s team wanted to predict mortality rates on a week-by-week basis. Because of the urgency with which results were required by the sponsor, there wasn’t enough time for traditional data-collection methods. So they drew from local social media.
"We gathered months of Twitter data from countries, primarily in English and French, and said, 'Can we extract certain variables from Ebola-related tweets?'” Subrahmanian explained. “Such as the intensity of fear in each of those countries – and anxiety, anger, depression and stress.”
From this information they were able to predict mortality and morbidity rates from Ebola during the next three weeks.
It has even been used to predict the likelihood of violence by the Boko Haram militant group. A database of nin years of data about Boko Haram was built from open sources. From this data, it was possible to learn predictive rules.
“We were able to come up with predictive rules,” Subrahmanian said. “If there is a month in which there are no reports of foreign recruitment by Boko Haram and no reports of arrests of Boko Haram fighters, then the probability of sexual violence attacks 5 months later is 86.4%.”
A useful tool, a personal risk
Tshilidzi Marwala, an engineer and the vice chancellor of the University of Johannesburg in South Africa, has long experience training young scientists in the global South. He provided a “developing world perspective” on artificial intelligence and machine learning.
In South Africa, he said, machine learning has been useful for HIV epidemiology. The country's national department of health provides HIV tests in various circumstances. For instance, hospitals will require it from pregnant women. A bank will even ask for an HIV test before deciding whether to grant a mortgage. So Marwala and his colleagues collected the data from the department of health to build models that could estimate its occurrence in the population. This proved useful for researchers, even though some data, such as about the income of the test-takers, was missing.
Questions remain about big topics that could someday transform everything from medicine to understanding of political behavior. For example: Can the vast amount of data be applied to machine learning to help predict when two countries will go to war? Variables scientists have thought of include distance between capital cities, for example.
Even so, the process presents complications. Statistical models are approximations of reality that will always be inconsistent, he explained, especially since the data received by scientists is always going to be incomplete and imperfect. Furthermore, the question of who owns data makes using it for big, sweeping, multinational projects complicated.
“I think we need to produce algorithms that work with limited and imperfect information, and secondly we need to think about regulation of ownership and data,” Marwala said. “And thirdly, we need to create infrastructure to create our own data. You can’t do it alone. We have to do it with our neighbours, and get enough data in order to solve our problems.”
Developing countries are still struggling with cloud computing because they often don’t have the hardware to handle the bandwidth, Marwala noted. Data can also be misused, he said, and that makes ownership of personal data – and the privacy of that data – a sensitive issue.
“People choose to exchange their personal data,” Marwala pointed out. “If you are on Gmail, you are basically entering a contractual agreement. The deeper question is: How does that affect you? If you are on Twitter, you are accepting that Twitter account in exchange for your private data – you are entering a contractual relationship. Moving forward, we will have to learn what data privacy means.”
Sean Treacy