Utilizing Name Entity Recognition to Identify Unique Actors

by Jerome M. Hendricks

My dissertation work explores the actions of intermediary firms in periods of rapid technological change. As our economy has become increasingly geared toward knowledge sectors (Powell and Snellman 2004), intermediary firms take on an increasingly important role in establishing markets and developing consumer relationships with products and services. Using the independent record store as a case of an intermediary operating in a rapidly changing market, I argue that certain actors can collectively alter the symbolic meaning of goods and services to enable their survival. To make this argument, I have collected over 2400 music industry and media documents from 1992-2012 in order to track changes in strategy and field understanding over time. In a recent paper, I was able to establish changes in organization understandings empirically by comparing the discourse of independent record stores before and after drastic technological innovation. Since then, I have been investigating ways I might test these findings by tracking types of music retail firms and their strategies and understandings over the twenty-year period. In the discussion that follows, I will share my experience looking to computer science technologies for new ways to identify and track actors and ideas in large data sets.

Early coding of my data utilized an ethnographic content analysis (ECA) approach through Atlas.ti, CAQDAS. ECA allows for the quantitative emphasis of structured data collection in association with descriptive information which informs the context in which meanings emerge (Altheide and Schneider 2013). I found this approach very useful and I could conceivably use it to answer other questions I have. However, from a practical standpoint, large data sets like mine take a long time to code and analyze this way. On the advice of a committee member of mine, Dr. John Mohr, I looked to new ways that social scientists have incorporated data mining software to extract meanings from large data sets. I was immediately intrigued by the growing body of literature that utilizes topic modeling as procedure for coding text into meaningful categories of word clusters associated within and across documents. For examples and analysis of this approach, refer to the special issue on topic models and cultural sciences in Poetics found here. While this approach to extracting “topics” would appear to offer a compelling way to uncover the strategies and understandings I am interested in, without isolating the types of actors associated with each topic, my unit of analysis shifts from the types of organizations to the data sources themselves.

In the same issue of Poetics mentioned above, Mohr and Bogdanov (2013) point to other compatible data mining strategies that can be combined with topic modeling to obtain an even closer view of meanings in texts. Specifically, Mohr and colleagues (2013) analyze the discursive style of the state utilizing a series of natural language processing (NLP), semantic parsing, and topic modeling procedures. This approach allows the authors to identify significant actors, determine their actions in texts, and consider the context in which these actions take place. A subfield of NLP called name entity recognition (NER) offers the most promise for identifying different people, places, organizations, and other miscellaneous human artifacts in texts. Because these tools require expertise in computer programming, I contacted Dr. Dan Roth at the University of Illinois at Urbana-Champaign to inquire further about NER and its applications. From his demo page, you can try a variety of different NLP procedures on sample texts including the NER tagger that will be the central focus of the remainder of this discussion.

Before discussing the specific approach that Dr. Roth, his assistant Chase Duncan, and I have taken, it is important to consider a few challenges with the NER tagger due to the unique nature of my data set. First, while my data set is large, it isn’t particularly massive. Currently, these tools are better suited for open exploration of hundreds of thousands of documents. So while my practical concerns over time management and the size of my data set are real, my data set is rather small relative to the concerns of computer scientists. With a data set in the thousands, accuracy becomes central as there is much less margin for error. In other words, we’ll simply have fewer opportunities to capture target organizations. This leads to a second concern; independent record stores are a somewhat unique entity and can be easily overlooked or misclassified by the NER tagger. Consider the Mohr et al. (2013) paper discussed above, the entities of interest are relatively well-known (multi-national organizations, nations, geo-political entities, and so on) and can be easily verified. Rather than finding references to the “United States” or “Afghanistan”, we are looking for “Dave’s Records” or “Bucket O’Blood Books & Records.” While both procedures require a certain amount of programming the tagger to improve its performance, I am unaware of a complete historical record of independent record stores that can be utilized for training purposes. More information on training an NER tagger for social science purposes can be found here.

Despite these challenges, our team is confident that we can train the NER tagger to perform at a high level despite the somewhat unique entities we aim to identify. The first step in training the tagger requires a separate data set that includes a variety of unique and standard music store names (from “Permanent Records” to “Musicland” to “Best Buy”) and mirrors the “messiness” of media data like links to other news stories, advertisements, and so on. To date, I have compiled 100 articles not used in the original data set for testing the NER tagger. To assist the tagger in identifying stores, we will incorporate Dr. Roth’s Wikifier tool which utilizes Wikipedia as an authoritative source for resolving identities. While many small stores will not be listed on Wikipedia, this will help us increase the accuracy of identifying large chain retailers and popular independent record stores throughout our data. As Dr. Roth and his colleagues have noted previously (Godby et al. 2009) other authority sources have the potential of increasing the effectiveness of resolving identity issues. To this end, utilizing various online databases of independent record stores (e.g. recordstoreday.com, vinylhunt.com, or goingthruvinyl.com) may also be useful in training the NER tagger. Once the modified version of the NER tagger is complete, we will be able to test our trained tagger on this separate data set and compare our results with human classification to assess accuracy and prepare our tool for the original “large” data set.

While it is entirely likely that my research will utilize some of the data mining software tools already familiar to social science research, our attempts to adapt the NER tagger to unique actors has significant implications for content analysis in social science research. In terms of data set size, our ability to train the NER tool more specifically will provide the required level of accuracy for smaller projects only attainable through manual coding procedures. In conjunction with other data mining procedures, such accuracy can allow for hypothesis testing as well as exploratory work. By standardizing these tagging procedures and training processes, the transferability among similar situations may suggest some generalizability of results. And, in light of the cooperative efforts that have brought us this far, prospects for software packages that are more accessible to social scientists, not unlike many topic modeling packages, also seem possible. Though these implications may be little more than conjecture on my part at this point, the prospects for developing procedures that contribute to new approaches to content analysis are exciting. I look forward to reporting the testing results as they become available and assessing the possibilities of NER tagging when actor identities are relatively unique.

References

Altheide, David L., and Christopher J. Schneider. 2012. Qualitative Media Analysis. Second Edition edition. Los Angeles: SAGE Publications, Inc.

Godby, Carol Jean, Patricia Hswe, Larry Jackson, Judith Klavans, Lev Ratinov, and Dan Roth. 2010. “Who’s Who in Your Digital Collection: Developing a Tool for Name Disambiguation and Identity Resolution.” Journal of the Chicago Colloquium on Digital Humanities and Computer Science 1 (2).

Mohr, John W., and Petko Bogdanov. 2013. “Introduction—Topic Models: What They Are and Why They Matter.” Poetics 41 (6). Topic Models and the Cultural Sciences: 545–69.

John W Mohr, Robin Wagner-Pacifici. Ronald L. Breiger, Petko Bogdanov. 2013. “Graphing the Grammar of Motives in National Security Strategies: Cultural Interpretation, Automated Text Analysis and the Drama of Global Politics.” Poetics 41 (6). Topic Models and the Cultural Sciences: 670-700.

Powell Walter W., and Kaisa Snellman. The Knowledge Economy. Annual Review of Sociology,. 2004;30:199-220.

Advertisements
This entry was posted in method, org soc, tech. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s