Q & A with Sun Kyoung Lee, Recipient of the $1M National Science Foundation (NSF) Economics Grant | Department of Economics at Columbia University

This project uses a big data approach to find out what lies behind the tremendous growth in the American economy in the nineteenth and twentieth centuries. Can you illustrate for readers what you’re going to be doing?

Lee: As the title “A Big Data Approach to Understanding American Growth” implies, this project is a major endeavor to understand crucial aspects of the United States’ process of economic growth through a big data approach. This project investigates the tremendous growth of the American economy in the nineteenth and twentieth centuries and analyze the country’s transition from a rural economy to an industrial nation through a big data approach. One of the major aspects of the project that I do is to construct the longitudinal database of every ancestor living in the United States and follow everyone over time. I have linked various historical data including U.S. demographic census across years, and other sources such as immigration database to the U.S. census.

The transformation of the U.S. economy during this time period was remarkable, from a rural economy at the beginning of the 19th century to an industrial nation by the end. More strikingly, after lagging behind the technological frontier for most of the nineteenth century, the United States entered the twenty-first century as the global technological leader and the richest nation in the world. What will the results from this project tell us about how people lived and how business operated?

Lee: We are still working on answering questions. However, to give you a bit of sneak peak, the funded project digitizes U.S. historical plant-level Census of Manufactures. With the constructed historical manufacturing census data, this project explores the factors that contributed to manufacturing growth at the firm level and how this, in turn, affected America’s aggregate economic growth. In another analysis, I investigate how a child’s chance of moving up relative to the child’s parents has evolved and investigate how economic shocks and policies affected such trends in the nineteenth and twentieth century. These are periods of large changes including emancipation, electrification and I wish that many interesting lessons can be elicited from the project.

What do you anticipate being the most challenging aspect of combining these various datasets?

Lee: Most datasets that I deal with are in archival format which means that they are locked in the original (typically paper, or sometimes in microfilm) format. Even if some datasets are in machine-readable format, the data is not harmonized at all which means there will be many and long process of data construction and harmonization before one can conduct research through data analyses. Still, even if data exists in harmonized, machine-readable format, in the absence of time-invariant individual-specific identifier(s), following same individuals is challenging for many reasons. Using frontier techniques of artificial intelligence, I link various datasets in different formats and link the same individuals over time.

I have worked for years on data integration, linking and documentation to merge information across data types and I would say such long process of data construction including data linking and harmonization is one of the most challenging aspect of combining and linking these various datasets.

Naturally, after working for years on this type of data, I have often felt that `there were reasons that such efforts were not made to this date!’ This type of work requires many different types of inputs and resources — manual data transcription, well-trained Research Assistants, super high-quality document scanner and powerful optimal character recognition software to name a few, not to mention researchers’ brain power and many cups of coffee to put together everything.

I am truly grateful of this grant from the National Science Foundation. This will not only expedite the current work flow of the ongoing projects, but also it will also enable us to pursue future projects that may be almost impossible otherwise. This resource-intensive project could be carried out thanks to the government and other organization such as Columbia University’s Program of Economic Research’s investment.[1]

I see that you’ve linked immigration/census records for immigrants from four countries. How many countries will be represented? Approximately how many individuals are represented in the immigration/passenger data?

Lee: Currently, more than a half of immigrant records are linked to the historical U.S. demographic census records. So far, this linked data represents the immigrants from four major immigrant-sending countries from Europe—the United Kingdom, Ireland, Germany, and Italy—during the Great Migration era. However, we are adding more immigration data from other countries other than listed countries. Additionally, we are collaborating with other institutes to paint the richer picture of immigrants. More things are on the way and please stay tuned!

How did you meet Costas Arkolakis and Michael Peters?

Lee: Years ago, Costas Arkolakis learned that I was working on restricted-access complete-count U.S. Federal Demographic Census records at economics conference and suggested that we collaborate together on the issues of America’s Urbanization. To borrow one of my advisor’s response regarding our collaboration, “There is a beautiful complementarity between his work and your work” and years of collaboration began.

Michael Peters is a colleague of Costas, and at the time Michael heard of my record linking of U.S. census records, Michael was working on a project that measures the long-run impact of the expulsion of the German population after the World War II. Therefore, when Michael heard of my ongoing collaboration with Costas, Michael became intrigued and wondered whether the project can be extended to an immigration channel (Given the political climate on immigration policies, issues of immigration were heavily discussed). Thankfully, I had the access to the required databases and technology that will enable for immigration study. Out of excitement, after several email changes, three of us met in New York City and began developing ideas to investigate the immigrants’ impacts on in the United States’ economic growth. There was a very good synergy among three of us, and we began developing ideas since then.

Is there anything, in particular, I should be sure to emphasize? To your mind, what is the most important aspect of this project?

Lee: When I visited my parents in Korea last summer, in one afternoon, my mom asked me “Sun, what do you work on for your studies? What keeps you busy?” When I explained my job market paper and the funded project, she responded “Wow. Do such data exist? Plus, it must be a ton of work.” I think this captures the essence of what and why I want to do. I wish to develop user-friendly, next-generation data resources that can help advance research in social sciences. I want to enhances acess to new data sources, vast amounts of data that simply were unattainable in the past. And I want to give back the constructed resources to the public so that researchers and interested public can uncover the underlying forces of America’s remarkable transition and growth.

Sun Kyoung Lee is a sixth-year Ph.D. student in the Department of Economics. Photo by Jeffrey Schifman.

[1] Columbia University’s Program for Economic Research has provided me seed funds to match individuals in census data in my study of American urbanization during 1850-1950.