Recovering Historical Data from Text: leveraging LLMs for social impact

Development Data Lab Team

April 2024

The problem - why aren't women working in India?

DDL's foundational mission is to apply cutting-edge tools and techniques to uncover and alleviate significant social problems. This story illustrates the power that AI - specifically modern LLMs like ChatGPT - to unlock new understanding about a critical issue affecting tens of millions of women in India (and beyond).

India is a confusing outlier on patriarchy: it's getting rich, but women are still secluded in the household and barely working. India has a female labor force participation rate (FLFP) of under 25%, compared to about 70% in North America and 60% in Sub-Saharan Africa. India has the highest number of working age women who are not in the workforce in the world.

Economist female labor force
Screenshot from Goats and Soda Blog
Economist female labor force two
ADB statistical portrait
Bloomberg female labor force

Social norms are generally considered to play a large role in the disappointing lack of female empowerment, but that role is contested and difficult to disentangle - let alone quantify. In order to make progress on this question, we need to understand which social groups have changed their norms around women's status and why they have done so. Does urbanization reduce the restrictiveness of a community's gender norms? What about living in a desegregated residential neighborhood, where you're exposed to more (and different) points of view? These questions aren't revolutionary, but there has been a big problem in answering them: virtually all ethnographic tabular data is static and based on recent surveys, making it impossible to understand these essential social dynamics.

If only we could learn what things were like in the pre-modern era, and examine why change isn't happening!

Data availability, or lack thereof

There is some good machine-readable data going back as far as the 1990s, but we need to go back much further in order to answer our patriarchy riddle. British colonial administrators collected information on Indian tribes and castes, for governance and strategic planning – and more nefariously, to divide and rule. There are thousands of castes in India, with a huge amount of heterogeneity in their cultural norms – the texts assembled by the Brits include details related to origins, historical lineage, social structures, customs, inheritance, and many other categories, published in a series of long-form volumes between the 1880s-1930s. Critically, the books contain a great deal of information focusing on marriage and gender norms.

Historical norms data book covers

These books became our target. We needed to take tens of thousands of pages of text, in unstructured images of varying quality, and convert them to a machine readable dataset that matched exactly to the flat tables we possessed describing cultural norms in the 1990s. (We would bring in additional, more recent data from there). How to proceed?

We estimated the expense and timeline of extracting all norms across all books manually, including the use of data entry services, and found that it would cost many tens of thousands of dollars – plus a great deal of additional engineering and manual work to validate the output – and would take at least six months, if not twice that. So while these data are highly valuable from an analytical point of view, accessing them for the public good has historically been prohibitively expensive. We needed a new approach.

Before we describe our application of LLMs to solve this problme, take a look at three sample excerpts from the texts highlighting various norms of interest. Notice a couple things: first, there is significant formatting heterogeneity, with inset subheadings and annotations; second: the heavy use of special characters with a wide range of diacritics; and third: differences in image quality. Some of the PDF pages are way worse than the ones shown here. These caused annoyances at both the OCR and querying steps, but were resolvable.

Historical norms data excerpt

Meat consumption behavior is described, with additional complexity around subgroup behavior.

Historical norms data excerpt

A relatively straightforward description of the widow remarriage norm.

Historical norms data excerpt

A clear remarriage description. Note subsequent text describing marriage rites and divorce customs.

The processing pipeline

The fundamental steps of the pipeline are to OCR the images from each page of each book to machine-readable text, then chunk those text files into community-level descriptions, and finally to submit queries to the AI that encode structured responses to the norms we are interested in. We OCRed about ~50k PDF pages through a tesseract-based workflow, with decent outcomes (we have since experimented with Claude-3-Sonnet with very promising results).

The chunking step involves identifying and partitioning text for about ~5000 subcastes. We ended up chunking manually rather than taking a RAG approach (Retrieval-Augmented Generation. Using a minor mode we created for Emacs, we added special markup which allowed us to chunk and extract community name lists that we then feed into our API calls. Our markup also helped us in identifying hierarchical relationships between groups and subgroups and excluding irrelevant sections like images and tables. We decided to take this approach instead of using RAG because this manual chunking technique was reasonably quick, deterministic, and very accurate - it ended up being less costly and much simpler for our specific use case.

The next step of the pipeline after creating community-wise chunks of the corpus is to ask questions to our LLM of about ~30 classes of norms, yielding both categorical responses (usually yes / no / no information) as well as a field capturing all relevant text excerpts. We run these queries for all communities.

Deploying the LLMs

After preprocessing our text, we needed to develop our prompts and our validation methods to maximize our performance. We had a range of knobs and tools to play around with, but one of the primary ones was the use of "function calling", which allows us to specify the exact format that ChatGPT delivers its response. Importantly for our use case, it allows us to partition the categorical or binary encoding of a norms variable (for example, rights of primogeniture = True) and all text excerpts that contain the raw, written information referencing that norm. The function call gives us another way to tailor our prompts towards our desired responses, in addition to the system message and prompt input fields.

Below is an example response from the API - it does what we want as far as providing both a categorical encoding of the norm and a compilation of text excerpts in separate fields. The JSON payload makes it incredibly easy to integrate the output into our postprocessing workflow, which generates tabular data, runs a variety of validation and data quality checks, and ultimately merges to data from the 1990s to create a panel spanning over a hundred years.

Norms output example

Validation

As with many applications of LLMs, validation and data quality checks are probably the most challenging area of the entire project. Developing the tools and concepts necessary to confidently evaluate the performance of our pipeline took some time and a lot of care. We've ended up in a place where we have a high degree of confidence in our output panel dataset, and developed ways to iterate quickly to facilitate both prompt engineering and identifying areas where our domain knowledge was being challenged.

Our results have been quite promising! In the confusion matrix below, you can see that our overall accuracy rate was roughly 95% - I expect this number to go up by the time we finish the project, but this is an acceptable amount of noise for our purposes. Crucially, GPT did well not only on the encoding, but also extracting all relevant text excerpts. We consider this error rate to be conservative – some of the encodings we marked incorrect were a judgement call that a human may have struggled with as well. These queries were run on vanilla GPT-4 and checked against manually encoded norms across three books. Since this confusion matrix was assembled, we've been working on scaling out our manually-defined test set, as well as adding complementary automated checks described below.

Norms output confusion matrix

We're currently finalizing our validation methodologies and plan to run the full set of queries in May of 2024. But we are confident that modern LLMs are finally giving us enough resolution and temporal coverage to clearly elucidate the role that cultural norms play in female empowerment, for the first time - and we hope to find the causal pathways and obstacles for India, which has remained stubbornly behind despite incredible economic growth.

So our next steps here really center on improving how we validate the model output, which can be simplified into three buckets. First, we're scaling out our confusion matrix by enlarging our library of manually-constructed tests for specific communities in specific books. This will increase the size of our confusion matrix substantially, and scale it across all of our target books.

Next, we're implementing a Response Evaluation model to focus on the internal consistency between the excerpts and the encoded norms variables returned by the function call. Basically the idea is to deploy a separate LLM to evaluate the output of the primary LLM. We won't be quite as confident in these results as our manual set of tests, but the Response Evaluation model easily scales across the entire dataset, which is essential. In our case, the model is essentially just replicating human evaluation paradigms and scaling them.

Lastly, we're experimenting with developing a confidence score that could help us flag ambiguity in interpreting the raw text, which we would then go and manually review. The aim of this project is to completely open-source the panel dataset when we're done with it. There are many many important analyses that should be done - far more than we could do, or we could even think of. One of the most rewarding parts of my work is seeing what the broader community does with the data we put out in the world - even if it means that other people find our bugs. J/K, I love a good bug report.

Norms GPT validation overview

Looking forward - other applications of this approach

So where do we go from here, given what we've learned? We think these techniques will generalize well to additional datasets that will help us answer important but historically intractable questions, like: "how much does urbanization and internal migration to cities determine upward mobility? How does zoning and other regulation affect growth and human outcomes in cities? How biased is the judicial system? And to what extent are major polluters causing morbidity and mortality as well as worsening climate change?

Some of the most socially important questions in developing countries are limited by reliable data - historical data, yes, but also recent data that are too difficult to access through traditional methods. We've already begun processing the written decision text from 80 million court cases, which like our norms data also exist in PDF form – with the help of LLMs, the content of these decisions will allow us to finally understand urban development, judicial bias, and political influence in the judiciary well enough to really push for evidence-based policy design. And we're just spinning up a big effort to partner with large private companies to mobilize their proprietary data to answer some of these questions that simply can't be solved with public data sources alone.

Bitmap icon of urban construction Bitmap icon of the scales of justice Bitmap icon of a politician Bitmap icon of a farm Bitmap icon of a doctor

Thank you for joining us on this journey! We're incredibly excited to be advancing this frontier, and believe there is a great deal of social good latent in these incredible new tools. We will continue to push at the margin. If you're interested in joining or supporting our mission, please reach out.

© Development Data Lab, 2024

About DDL

Team

Careers