Big Data Industrial Problem Solving Workshop
at the Fields Institute, 222 College Street, Toronto

Organizing Committee: Paul McNicholas, Huaxiong Huang, Tyler Wilson

Schedule

Problems

Visitor Resources

Overview:

The interaction between industry and academia has many potential benefits for both. Academics learn about interesting potential research problems and find application for their existing tools. Industries get access to some of the most experienced mathematical modellers and problem-solvers on the continent. At the end of the week, the academic experts make a presentation consisting of the problem restatement and their solution. This is a summary of results; the teams also prepare reports for the industrial sponsors.

History and mission statement

Fields Institute for Mathematical Sciences:
Founded in 1992, the Fields Institute plays a central role in "promoting contact and collaboration between professional mathematicians and the increasing number of users of mathematics". It supports research in pure and applied mathematics and statistics. Thematic programs of international interest, academic workshops, and prizes are organized by the Institute.

Of specific interest to the business community is the Commercial and Industrial Mathematics program. This program seeks to develop synergistic links between mathematicians and industrial partners. The Fields Industrial Problem-Solving Workshop (FIPSW) is a new initiative in this direction.

What the workshop is about:

Objectives:
The objective of the IPSW is to connect industries with faculty, postdocs and graduate students who have expertise in industrial case-studies. This interaction is fostered in the specific context of a problem-solving session over 5 days. The case-studies in question have a significant mathematical or statistical content.

The interaction between industry and academia has many potential benefits for both. Academics learn about interesting potential research problems and find application for their existing tools. Industries get access to some of the most experienced mathematical modellers and problem-solvers on the continent.

Format:
The IPSW will occur over 5 days. Participants will include a group of academic experts (including mathematicians and statisticians) as well as experts from industry. On the first day, the industrial sponsors will present their problem statements. The academic experts will divide into small teams, with one team assigned to each problem. The teams spend the next 3 days collaborating on solutions to their problem, and present their solution on the final day of the workshop.

Deliverables:
At the end of the week, the academic experts make a presentation consisting of the problem restatement and their solution. This is a summary of results; the teams also prepare reports for the industrial sponsors.

2015 Problems

Problem 1: Presenters - DBRS

Corporate Credit Estimates Problem Description

The big picture project that we are working on is to take a large set of financial data from all tax filing companies in Europe over a limited time period and from this develop an entirely quantitative credit rating methodology. Analyst developed credit ratings are very time consuming and often all that is needed is an estimate. We are working to create a method that is either instant or at lease takes a minimal amount of an analyst's time.

Credit ratings follow a scale from A (safe), B (investment grade), C (junk) and D (imminent default). Each of A to C are divided in to triple, double and single, (AAA, AA and A) from highest to lowest.

Finally these are sometimes further divided in to high, stable and low (AA-high, AA-stable and AA-low) for a total of up to 28 possible ratings. Every rating category has an expected default rate, but the companies with higher ratings have very few defaults that occur while still in that category. Usually companies get downgraded before defaulting. We are looking for a model that can correctly rank companies against each other, then if some of the company's analyst ratings are known we can apply categorical ratings to the others.

Our dataset consists of company information and related accounting information for all European companies. The company information is a company identifier, industry, size, country, and a series detailing the company's legal status over time. The accounting information is all income statement (earnings, expenses ...) and balance sheet (assets, liabilities ...) data as well as some share price and market data for public companies. Most of this data is only available yearly and there is at most 10 years of data for each company. We refer to one observation as the accounting data for one firm in one year.

For most of the types of statistical models that we have looked at there are assumptions required on the data that we must test. First we need independent observations, but our observations are grouped by company and by year and then by industry and country. To add to the complexity here our observations are high dimensional. We should also know the distribution of our data, but we would need a way to test the distribution of multidimensional data.

We derive our default data from the companies' legal statuses over time, some of which are very clear whether the company has defaulted or not, but others are fairly ambiguous. We hope to find a way to include these ambiguous statuses in our dataset as either healthy or defaulted companies. Can this be done without creating bias in our data?

Finally, in accounting it is common knowledge that more information can be drawn from ratios of accounting variables than from the accounting variables themselves. These ratios are just simple multivariate functions of two to three accounting variables. In creating our model we will need to look for ratios that are better predictors of default than others, there is some intuition involved but we need to find a way to create, or at least choose, these ratios statistically.

Problem 2: Presenters - The TMX Group

Within the capital markets ecosystem, volatility is defined as a measure for variations in the price of a stock over time. While volatility correlates with the frequency a stock is traded, little research exists on how volatility comes to exist in the first place.

Using multiple data sources including Canadian stock market data, the Fields Institute Problem Solving Workshop in Big Data will partner with the TMX Group to identify causal factors creating price volatility differences between highly traded stocks, and those which trade less frequently.

Problem 3: Presenters - GlaxoSmithKline

Pharmacovigilance in small sample sizes, rare adverse drug events, and low drug exposure prevalence

Although early detection and assessment of drug safety signals are important, post-approval drug safety studies often face challenges such as small size, rare incidence of adverse outcomes, and low exposure prevalence after the launch of a new drug or vaccine. In addition, nonrandomized studies of treatment effects in healthcare data are vulnerable to confounding bias. Propensity Score (PS) methods are increasingly used to control for measured potential confounders, especially in pharmacoepidemiologic studies of rare outcomes in the presence of many covariates from different data dimensions of large administrative healthcare and electronic health records databases.

The High-Dimensional Propensity Score (hd-PS) algorithm is a semi-automated software can select and adjust for baseline different characteristics of patients for drug and vaccine safety studies. This software is used by investigators including FDA Sentinel, European Medicines Agency (EMA) to monitor the drug and vaccine safety. The hd-PS algorithm prioritizes variables within each data dimension (e.g., inpatient diagnoses, inpatient procedures, outpatient diagnoses, outpatient procedures, dispensed prescription drugs) by their potential for confounding control based on their prevalence and on bivariate associations with the treatment and with the study outcome. Once variables have been prioritized, a predefined number of variables with the highest potential for confounding per dimension is chosen to be included in the PS.

To early detect and evaluate drug safety signals is important, however the hd-PS may face the challenges in the situations such as small sample sizes, rare adverse drug events, and low drug exposure prevalence. Our proposed solutions to aggregate medical codes using hierarchical coding systems improved the performance of the hd-PS to control for confounders by reducing up to 19% bias in an empirical example. We will share the study findings and discuss further research to prove the benefits of this aggregation method.

References

The central role of the propensity score in observational studies for causal effects
Paul R. Rosenbaum (University of Wisconsin, Madison) & Donald B. Rubin (University of Chicago)

High-dimensional propensity score adjustment in studies of treatment effects using health care claims data
Sebastian Schneeweiss, Jeremy A. Rassen, Robert J. Glynn, Jerry Avorn, Helen Mogun, and M. Alan Brookhart (Department of Medicine Brigham and Women’s Hospital/Harvard Medical School)

Effects of aggregation of drug and diagnostic codes on the performance of the high-dimensional propensity score algorithm: an empirical example
Hoa V. Le, Charles Poole, M. Alan Brookhart, Victor J Schoenbach, Kathleen J. Beach, J Bradley Layton, and Til Stürmer

Problem 4

Women's Rugby Sevens

The utilization of sports analytics is rapidly growing field that is changing team's approaches towards training, preparation, and competition tactics/approaches. In recognizing the benefits of incorporating sport analytics in order to gain a competitive advantage over teams, Rugby Canada Women's Sevens program has already been collecting a host of physiological, anthropometric, medical, positional, movement, technical and tactical data on its players and teams. Examining the relationship between the various data streams will address a number of gaps both on and off field. Tactical analysis will allow coaches to better target game strategies, training approaches, and player selection. Identification of key tactical indicators will allow coaches to adjust game strategy at optimal times. Targeted training programs based on key game tactical indicators will impact performance through improving tactical indices leading to better game understanding and decision-making. Understanding the relationship between datasets to tactical outcomes will create tactical performance profiles that would further our understanding of top player characteristics, fostering talent and helping with player selection procedures. We currently need expert support in effectively integrating results to assist the team in:

1) Identify performance indicators that best predict match outcomes (general, pool vs. cup, top vs. bottom teams,…)

2) Measure the contribution of each player to team success (possibly using parameters identified in 1))

3) Understanding strengths and weaknesses of our team and opponents

4) Tracking player development and game development

5) Maximize the training effectiveness of our daily training environment

Schedule

Monday May 25

9:00 - 10:30 Problem presentations and discussions

10:30 Coffee break

11 - 12:30 Problem presentations and discussions (continued)

12:30 Lunch onsite

1:30 - 3:00 Group discussions

3:00 Coffee break

3:30 - 5:00 Group discussions

5:00 Summary session

Tuesday May 26

9:00 - 10:30 Problem presentations and discussions

10:30 Coffee break

11 - 12:30 Problem presentations and discussions (continued)

12:30 Lunch onsite

1:30 - 3:00 Group discussions

3:00 Coffee break

3:30 - 5:00 Group discussions

5:00 Summary session

Wednesday May 27

9:00 - 10:30 Problem presentations and discussions

10:30 Coffee break

11 - 12:30 Problem presentations and discussions (continued)

12:30 Lunch onsite

1:30 - 3:00 Group discussions

3:00 Coffee break

3:30 - 5:00 Group discussions

5:00 Summary session

Thursday May 28

9:00 - 10:30 Problem presentations and discussions

10:30 Coffee break

11 - 12:30 Problem presentations and discussions (continued)

12:30 Lunch onsite

1:30 - 3:00 Group discussions

3:00 Coffee break

3:30 - 5:00 Group discussions

5:00 Summary session

Friday May 29

9:00 - 10:30 Final presentation

10:30 Coffee break

11 - 12:30 Final presentation

12:30 Lunch onsite

Previous Industrial Problem Solving Workshops:

August 11-14, 2014
Fields-MPrime Industrial Problem-Solving Workshop

August 20-24, 2012
Industrial Problem-Solving Workshop on Medical Imaging

June 22-26, 2009
OCCAM-Fields-MITACS, Math-in-Medicine Study Group

August 11-15, 2008
Fields-MITACS Industrial Problem-Solving Workshop

August 14-18, 2006
Fields-MITACS Industrial Problem-Solving Workshop

Back to to p