Harvard Business Review data analysis

Competition Entry : https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect/prospector?prospectId=119

Natural Language Processing Tools were employed to determine the trends over 90 years. The resulting visualizations can be utilized to know some of the most important phrases over the decades, some organizations and persons that were most talked about, and the percentage distribution of the articles in a particular year in various categories.

Method

I used two techniques : KeyPhrase Extraction and Named Entity Recognition. Abstracts of articles presented in each decade were grouped together and TextRank algorithm was used to find essential key phrases and words in the grouped texts. I used a Java-based implementation readily available to implement the TextRank algorithm in conjunction with WordNet. The keyphrases were ranked by their importance in context with the decade (link rank), and their occurence (count rank). I separated the commonly referred words with a high count rank like “Company”, “Organization” etc from these key phrases, and these were later used to categorize the articles.

Named Entity Recognition (NER) was carried out using Stanford NER. The Recognizer was employed over the abstracts to pull out Person & Organization Entities. Top 30 of each of these entities were selected, to understand which were the most discussed individuals and organization in any particular decade. There was some filtering carried out to remove single named individuals like “John”,”Sam” etc. and organizations like “harvard”, as these were redundant with no significant meaning. Finally I also carried out NER on all author affiliations to know precisely where majority of the published authors originate.

Most of the visualizations were developed using D3 Javascript visualization library and Google Chart & Maps API.

Results

[Show slideshow]

I have used around 200 key phrases for each decade and a word cloud visualization for each. The reason for usage of 200 key phrases, was to capture the gist of the entire decade. From these visualizations, one can easily make out that the 2010s are all being about social networks or air-cargo models and 2000s had some tough new criminal penalties in place.

Categorization

I further categorized the articles using some terms which had the most count rank during the key phrase extraction. Obviously, terms like ‘article’,’letter’, ‘issue’ were filtered. Moreover each category comprises of all relevant concepts associated to it. For example, ‘Employ’ comprises of ‘Employee Health Insurance’, ‘Employee Rights’ etc. Some interesting facts to understand is the decreasing importance of concepts based on ‘Customer’ and ‘Strategy’ as compared to ‘Competition’ or ‘Product’, or the increasing discussion of concepts related to ‘Academia’.

Most Discussed Organizations

Named Entity Recognition extracted many key terms and divided them, based on Organization, Person and Location. I selected the top 30 Organizations and Persons which were most discussed, after filtering as discussed above. A major point that first pops up on seeing the visualization, is the huge bubble of ABN-AMRO Bank in the 1990s. This complements the historical merger of two of the largest banks during that time and the series of acquisitions (Apparently the term ‘Acquisition’ can be seen to be discussed a lot during the 1990s-2000s from the previous graph). The ‘Securities & Exchange Commission’ seems to be a lot discussed during the 70s, during which the term ‘Regulation’ can also be seen to have a larger count rank.

Most Discussed Individuals

We can clearly see the rise of the ideologies of Michael Porter and Peter Drucker during the 90s.

Where are the authors located??

Finally I developed a visualization to understand where exactly do the authors of the articles in Harvard Business Review originate. Well, the answer is from all the corners of the world. (Maybe not Africa!!)

The visualizations can be downloaded in a higher resolution from https://maulik-kamdar.com/code/harvard/ or can be seen in a friendly user interface at https://maulik-kamdar.com/2012/08/harvard-business-review-vision-statement/#results (Left-Right Keys to navigate)

Method

Results [Show slideshow]

Results

[Show slideshow]