April 2024
DDL's foundational mission is to apply cutting-edge tools and techniques to uncover and alleviate significant social problems. This story illustrates the power that AI - specifically large language models like GPT4 - has to unlock new understanding about a critical issue affecting tens of millions of women in India (and beyond).
India is a confusing outlier on patriarchy: it's getting rich, but women are still secluded in the household and barely working. India has a female labor force participation rate (FLFP) of under 25%, compared to about 70% in North America and 60% in Sub-Saharan Africa. India has the highest number of working age women who are not in the workforce in the world.
Social norms are generally considered to play a large role in the India's low female labor force participation, but that role is contested and difficult to disentangle - let alone quantify. We want to understand which social groups have changed their norms around women's status and why they have done so. Does urbanization reduce the restrictiveness of a community's gender norms? What about living in a desegregated residential neighborhood, where you are exposed to more (and different) points of view? These questions aren't revolutionary, but there has been a big problem in answering them: virtually all ethnographic data that has been systematically coded (the primary example being the Ethnographic Atlas) is from a single slice in time.
Beginning in the 19th century, British colonial administrators collected information on Indian tribes and castes, for governance and strategic planning – and more nefariously, to divide and rule. There are thousands of castes in India, with a huge amount of heterogeneity in their cultural norms – the texts assembled by the Brits include details related to origins, historical lineage, social structures, customs, inheritance, and many other categories, and were published in a series of long-form volumes between the 1880s and 1930s. Critically, the books contain a great deal of information on marriage and gender norms.
These books became our target. We needed to take tens of thousands of pages of text, in unstructured images of varying quality, and convert them to a machine readable dataset, which we could then read and parse for culture norms. Before 2022, this work would have been prohibitively costly, requiring thousands of human hours just for one pass through the texts. Language models now make it possible to systematically ask the same questions of thousands of ethnographies.
Let's consider three sample excerpts from the texts highlighting some norms of interest. Notice a couple things: first, there is significant formatting heterogeneity, with inset subheadings and annotations; second: the heavy use of special characters with a wide range of diacritics; and third: differences in image quality. Some of the PDF pages are way worse than the ones shown here. But language models excel at handling these kinds of errors.
The fundamental steps of the pipeline are to OCR the images from each page of each book to machine-readable text, then chunk those text files into community-level descriptions, and finally to submit queries to the AI that encode structured responses to the norms we are interested in. We OCRed about ~50k PDF pages through a tesseract-based workflow, with decent outcomes (we have since experimented with Claude-3-Sonnet with very promising results).
The chunking step involves identifying and partitioning text for about ~5000 subcastes. We ended up chunking manually rather than taking a RAG approach (Retrieval-Augmented Generation), as the RAG was unreliable. We added special markup to the OCR texts which allowed us to chunk and extract community name lists that we could feed into our API calls. Our markup also helped us in identifying hierarchical relationships between groups and subgroups and excluding irrelevant sections like images and tables. We decided to take this approach instead of using RAG because this manual chunking technique was reasonably quick, deterministic, and very accurate - it ended up being less costly and much simpler for our specific use case.
The next step of the pipeline after creating community-wise chunks of the corpus is to ask questions to our LLM of about ~30 classes of norms, yielding both categorical responses (usually yes / no / no information) as well as a field capturing all relevant text excerpts. We run these queries for all communities.
After preprocessing our text, we needed to develop our prompts and our validation methods to maximize our performance. We had a range of knobs and tools to play around with, but one of the primary ones was the use of "function calling", which allows us to specify the exact format that ChatGPT delivers its response. Importantly for our use case, it allows us to partition the categorical or binary encoding of a norms variable (for example, rights of primogeniture = True) and all text excerpts that contain the raw, written information referencing that norm. The function call gives us another way to tailor our prompts towards our desired responses, in addition to the system message and prompt input fields.
Below is an example response from the API - it does what we want as far as providing both a categorical encoding of the norm and a compilation of text excerpts in separate fields. The JSON payload makes it easy to integrate the output into our postprocessing workflow, which generates tabular data, runs a variety of validation and data quality checks, and ultimately merges to data from the 1990s to create a panel spanning over a hundred years.
As with many applications of LLMs, validation and data quality checks are challenging. We are building a validation toolkit that will give us a high degree of confidence in the output data and signal where we may be making mistakes.
Our preliminary results are promising. In the confusion matrix below, our overall accuracy rate is about 95%, when compared with human classifiers. Crucially, GPT does well not only on the encoding, but also in extracting all relevant text excerpts. And this error rate is conservative – some of the encodings we marked incorrect were a judgement call that a human may have struggled with as well. Indeed, one of the challenges of this project is developing norms classifications that can accurately represent the incredible diversity and context-specificity of the norms in the texts.
Our preliminary tests give us confidence that modern LLMs have enough resolution and temporal coverage to describe gender-relevant cultural norms in great detail, and for the first time, to map how they have changed from the beginning to end of the 20th century, for thousands of different social groups.
Some of the most socially important questions in developing countries are limited by reliable data - historical data, yes, but also recent data that are embedded in narrative descriptions. We have already begun processing the written decision text from 80 million court cases. We plan to use the content of these decisions to better understand legal barriers to urban development, judicial bias, and political influence in the judiciary. We're also experimenting with parsing text from municipal bylaws, to systematically understand restrictions on land development in cities around the world.