SIL 802, Winter 2017: Data Science for Development

As always, the world seems to be at the brink of disaster. Each day we read about pollution, climate change, corruption, inequality, exploitation... The world today is however a lot more instrumented than it ever was. Data from IoT sensors, satellite imagery, frequent surveys and censuses, market linkage through communication tools, social media, government MIS systems, etc can possibly all be pooled together to know about problems before it is too late, build models that could suggest solutions, and understand about what's happening in the world a bit better. We will look at various opportunities and challenges in data collection, curation, and analysis, in several different application areas. Students will work in groups and implement large data collection and analysis systems, or write termpapers on some specific topics.

The course will be structured along three tracks that will run pretty much in parallel:

  • Thematic background required to be able to ask the right questions. This will include readings on poverty, development, inequality, industry structure, agriculture value chains, corruption, etc
  • Data analysis methods. We will do a crash course starting from probability distribution, and go on to study the basics of hypothesis testing, linear and nonlinear regression, ANOVA, time series analysis with AR/MA models, and move on to the statistical machine learning methods with SVM/decision trees/KNN/bayes classifiers, unsupervised techniques for clustering/PCA/ICA, and spatiotemporal data analysis.
  • Applications. We will look at a wide range of topics such as using mobile phone call records to estimate poverty, infer migration patterns, satellite data to map land use, social media data to understand unemployment issues, social network analysis of corporate and political networks, and much more.

All readings for the course have been uploaded to a dropbox folder, ask me for a link. Note that we won't study all the papers, but we will try to cover all the three aspects listed above from across the readings. Several project ideas are interspersed in the table, and some more are listed towards the end.


  • Chapters from Jean Dreze and Amartya Sen, An Uncertain Glory: India and Its Contradictions
  • Chapters from Daryl Collins, et al, Portfolios of the Poor: How the World's Poor Live on $2 a Day
  • Chapters from Amartya Sen, Development as Freedom
Poverty and inequality Theory
  • Chapters from Thomas Piketty, Capital in the Twenty First Century
  • Wolfgang Streek, How Will Capitalism End
Data from mobile phones
  • Joshua Blumenstock..., Predicting Povery and Wealth from Mobile Phone Data
  • Mini-project idea: Make a super strong presentation on what we can do with mobile CDR data, and we will take it to the Indian telecom providers. The Indian telcos have been very conservative in giving access to CDR records, but there are many examples from other parts of the world.
Data from satellites
  • Michael Xie..., Transfer Learning from Deep Features for Remote Sensing and Poverty Mapping
  • Yong Suk Lee..., International Isolation and Regional Inequality: Evidence from Sanctions on North Korea
Mapping sociological patterns at smaller scales
  • David Peters, Income Inequality across Micro and Meso Geographic Scales in the Midwestern United States
  • Joshua Blumenstock, Inferring Patterns of Internal Migration from Mobile Phone Call Records: Evidence from Rwanda
More applications
  • Kuch Varshney..., Targeting Villages for Rural Development Using Satellite Image Analysis
  • Brian Dillion, Using Mobile Phones to Collect Panel Data in Developing Countries
  • Project idea: Use similar approaches to analyze Google Earth Engine data and see if the satellite data based analysis coincides with unit level census surveys. Census surveys are done every 10 years or so in India, but with satellite data we can then do much higher frequency assessments of poverty changes. It will be very interesting to also understand spatial correlation in poverty changes at different levels - villages/blocks/districts, and look at the spatial distribution of inequality at the village level.

Big corporate and political networks

  • Chapters from Joseph Stiglitz, The Price of Inequality
  • Chapters from Atul Kohli, Poverty Amid Plenty in the New India
  • Chapters from C. Wright Mills, The Power Elite
Firm level analysis of political and corporate interlocks
  • Mara Faccio, Politically Connected Firms
  • G. William Domhoff, Who Rules America: The Class-Domination Theory of Power
  • R. Narayanaswamy, Political Connections and Earnings Quality: Evidence from India
  • Asim Khwaja..., Do Lenders Favour Politically Connected Firms
  • Project idea: Already underway, we are assembling a large graph of companies, company ownership, board members, politicians, bureaucrats, locations, etc, and analyzing the graph.
Macro level analysis of political and corporate relationships
  • R. Kavita Rao, Revenue Foregone Estimates: Some Analytical Issues
  • Project idea: Build nice visualizations to understand national accounting - where is capital stored, how does it flow, from whom to whom... at a macro level
Control in corporate networks
  • Nick Godfrey, Why is Competition Important for Growth and Poverty Reduction
  • Dalhia Mani..., Moving Beyond Stylized Economic Network Models: The Hybrid World of the Indian Firm
  • Stefania Vitali..., The Network of Global Corporate Control
  • Bruce Kogut..., Restructuring or Disintegration of the German Corporate Network
  • Project idea: Already underway, to understand the ownership structures and director interlocks among Indian companies. But we need access to historic data to also be able to trace network changes over time

Elections and local political networks

  • Sandip Sukhtankar, Sweetening the Deal: Political Connections and Sugar Mills in India
  • Devesh Kapur..., Quid Pro Quo: Builders, Politicians, and Election Finance in India
  • Brian Min..., Electoral Cycles and Electricity Losses in India

Agriculture, manufacturing, and employment

Agriculture patterns
  • Using agriculture census data
    • Ramesh Chand..., Farm Size and Productivity: Understanding the Strengths of Smallholders and Improving Their Livelihoods
    • Bhim Reddy..., Rise of "New Landlords": A rejoinder
  • Using web and satellite data
    • Sunandan Chakraborty..., Computing the Rate of Disappearance of Cropland Using Satellite Images
    • Sunandan Chakraborty..., Using Web Information Sources for Location Specific Summarization of Climatic and Agricultural Trends
  • Improved data collection
    • Calogero Carletto..., From Guesstimates to GPStimates: Land Area Measurement and Implication for Agricultural Analysis
  • Project idea: Find spatiotemporal patterns in the agricultural census data to identify outliers, and search for newspaper articles which can explain the observations.
  • Project idea: Match the agriculture census data against satellite data to understand the accuracy of data collection.
Mining patterns
  • Rich Lands Poor People
  • Norman Loayza..., Poverty, Inequality, and the Local Natural Resource Curse
  • Project idea: Use satellite data observations around mining areas to understand changes in local socioeconomic conditions. Match with unit level census data as well.
Industrial analysis
  • Using census and industry surveys
    • Jayan Jose Thomas, India's Labour Market During the 2000s: Surveying the Changes
    • Abhiman Das..., Profitability of the Indian Corporate Sector: Productivity, Price, or Growth?
    • Dipak Mazundar, Employment and Inequality Outcomes in India
    • Balwant Singh..., Sectoral Linkages and Growth Prospects: Reflections on the Indian Economy
  • Using social media data
    • Alejandro Llorente..., Social Media Fingerprints of Unemployment
    • Dolan Antenucci..., Using Social Media to Measure Labour Market Flows
    • Scott R. Baker..., Measuring Economic Policy Uncertainty
    Project idea: Use user-generated-content from platforms like Gram Vaani to overlay against macro observations, and triangulate the observations in a mixed methods approach.

Commodity prices and agriculture value chains

Data collection
  • Joshua Blumenstock..., The Price is Right? Statistical Evaluation of a Crowd-sourced Market Information System in Lliberia
  • Alberto Cavallo..., The Billion Prices Project: Using Online Prices for Measurement and Research
  • Project idea: Build an app for brand X that local retailers can use to take a photograph of inventory on their shelves. Run image analysis on the photos to finally build a map of which X's competitor products are also stocked up. The retailers can be given a higher margin for reporting data regularly.
Explaining price fluctuations
  • Ashutosh Kumar Tripathi, Decomposing Variability in Agriculture Prices: The Case of Selected Indian Agricultural Commodities
  • Vijay Kumar Varadi, An Evidence of Speculation in Indian Commodity Markets
  • Sunandan Chakraborty..., Predicting Socio-Economic Indicators Using News Events
  • Project idea: Already underway, analysis of commodity price tnd implement large data collecti, and use newspaper articles to classify their cause, especially to spot cases of hoarding at the local level.
Supply chains
  • Barbara Harriss-White, West Bengal's Rural Commercial Capital
  • Sukhpal Singh, New Markets for Smallholders in India: Exclusion, Policy and Mechanisms
  • Soham Sen..., ICT Applications for Smallholder Inclusion in Agribusiness Supply Chains
  • Project idea: Visit different farmer producer organizations and build a value chain of their supply chain. Design app/IVR/SMS based ICT solutions to help the value chains operate in a more robust manner. We saw with the recent demonetization event how easily these supply chains can be disrupted, ICT solutions can possibly help smooth out transactions and market linkages.

Financial inclusion

  • Sa-dhan, The Bharat Microfinance Report 2016
  • Sohini Paul, Creditworthiness of a Borrower and the Selection Process in Micro-finance: A Case Study from the Urban Slums of India
  • Project idea: Census and NSSO data report people's access to microfinance and SHG networks. Compare the data reported by the microfinance organizations (aggregated by Sa-dhan) with the census and NSSO data, to check for a close match.
  • Project idea: Already underway, to use platforms like Gram Vaani to check how the microfinance funds are being utilized by the borrowers, problems faced by them, etc, and compare data reported by the people against the data reported by the microfinance organizations.
  • Project idea: Use census data about access to banking services, and compare with socioeconomic data from the regions to understand if banking penetration has increased in a strategic manner or ad hoc.
Health Using web data
  • Steven L. Scott..., Predicting the Present with Bayesian Structural Time Series
  • David Lazer..., The Parable of Google Flu: Traps in Big Data Analysis
Using HMIS data
  • Martin Mubangizi..., Coupling Spatiotemporal Disease Modeling with Diagnosis
Using mobile phone data
  • Andrew Tatem..., Integrating Rapid Risk Mapping and Mobile Phone Call Record Data for Strategic Malaria Elimination Planning
  • Enrique Frais-Martinez..., Agent-based Modeling of Epidemic Spreading Using Social Networks and Human Mobility Patterns

Disaster preparedness

  • Raymond P. Guiteras, Satellites, Self-reports, and Submersion: Exposure to Floods in Bangladesh
  • Vanessa Frais-Martinez..., Measuring the Impact of Epidemic Alerts on Human Mobility
Media Theory
  • Chapters from Edward Herman and Noam Chomsky, Manufacturing Consent: The Political Economy of Mass Media
  • Amelia Arsenault..., The Structure and Dynamics of Global Multi-Media Business Networks
Content analysis
  • Ceren Budak..., Fair and Balanced? Quantifying Media Bias Through Crowdsourced Content Analysis
  • Lada Adamic..., The Political Blogosphere and the 2004 US Election: Divided They Blog
  • M.D. Conover..., Political Polarization on Twitter
  • Eytan Bakshy..., Exposure to Ideologically Diverse News and Opinion on Facebook
  • Project idea: Already underway, to detect bias in Indian mass media sources, and relate it to the ownership networks of the media houses.
  • Project idea: Most mass media sources are now quite active on social media and use it as a distribution channel. Obtain Twitter data to track the flow of mass media articles on the social network. See if the spread of the articles tends to be constrained within certain social network boundaries, and if the boundaries for different sources overlap with each other. In short, try to answer if mass media bias gets neutralized on social media, or amplified?
Information diffusion
  • Sharad Goel..., The Structural Virality of Online Diffusion


  • Douglas Fabini..., Mapping Induced Residential Demand for Electricity in Kenya
  • Noah Klugman..., Grid Watch: Mapping Blackouts with Smart Phones
  • Fabien Chraim..., Monitoring Track Health Using Rail Vibration Sensors
Experiments and evaluation Methods
  • Ron Kohavi..., Controlled Experiments on the Web: Survey and Practical Guide
  • Joshua Blumenstock..., Promises and Pitfalls of Mobile Money in Afghanistan: Evidence from a Randomized Control Trial
  • Joshua Blumenstock..., Risk Sharing and Mobile Phones: Evidence in the Aftermath of Natural Disasters
  • Marianne Bertrand..., What's Advertising Content Worth? Evidence from a Consumer Credit Marketing Field Experiment
  • Mike Yeomans..., Making Sense of Recommendations


Interesting links

The 4th paradigm: Data intensive scientific discovery

The Earth Engine

CEGA conference on technology for infrastructure monitoring

Data science for social good, projects at the University of Chicago

Piketty: Capital in the 21st century

Open Corporates


Project ideas


The Indian agricultural census has recorded data each year for the last 10+ years on land use, irrigation, fertilizer use, etc, right down to the village level. Can we detect trends and outliers from the data, and match them with policy changes and other events that can explain the analysis, or point out the inexplicable occurrences.


Can we detect illegal practices such as hoarding from an analysis of commodity price and arrival data across 1500+ mandis of India. Can we augment this with agricultural production and demand statistics to understand why sudden price fluctuations happen in the first place.


The Annual Survey of Industries brings out detailed information on employment statistics and wages. How do these correlate with government welfare expenses, and with health, education, and other trends.


The use of nighttime satellite imagery to detect electrification has been seen to correlate nicely with socio-economic conditions in the places. Can we use similar methods to correlate observed data with government published data on electrification.


Can we build nice visualizations of the Indian public accounting system to clearly see what income is realized and where is it spent, and trace individual cash flow paths.


Can we detect overlaps between politicians and companies, and help journalists investigate patterns by providing them with nuggets of relevant mass media articles and social network linkages.


Entity extraction algorithms can be adapted for Indian languages to mine regional newspapers. This news can then be listed against socio-economic and industrial trends mined about the location through surveys and other means as listed above, and also help augment the useful chunks of information about political and corporate overlap to be provided to journalists.


Can we understand how data is collected about agricultural mandi prices, agricultural census, soil conditions, etc, and improve the process through an appropriate application of ICTs to be able to collect data more frequently, automatically, and reliably.


Can we analyze Indian social media data and use it to understand the degree of polarization in people's access to news sources. This can also be related to bias detection in mainstream media, and cross-checking against ownership of media companies.


Identify the application of IoT sensors in areas like irrigation, vaccination cold storage, vegetable cold storage, soil conditions, etc, or even the use of drones/UAVs, and build business plans aimed at high frequency data collection to improve the efficiency of operations.