The Role of Data in Slowing Coronavirus

Forensic Data Analytics Challenges: Part 4 of Many - Leveraging Data Despite Known Limitations

Introduction

We are facing a verifiable pandemic according to the World Health Organization. Coronavirus has hit almost every country in the world. In the United States, all but one state has witnessed their first confirmed case. Guidance in this country from the top has been mixed, some suggesting that everything is going to be all right and that no dramatic actions are required. On the other hand, a number of medical officials have suggested that social distancing is a must to ‘flatten the curve” which is a term used to describe a reduction in the rate of infections. Adding to the confusion is that stated death rates from the disease differ depending on who you ask. When the information we receive contradicts each other, what should we believe? When we look for truth, a good place to start is always with the data.

Death rates

Let’s start by taking a closer look at the computation of death rates. We have heard consistently that the death rate of COVID-19 is greater than the death rate of the flu. The death rate of the flu is generally regarded as 0.1% meaning that 1 out of every 1000 reported cases is likely to lead to death. For coronavirus, we have seen wildly different ratios in the past few months. Currently Hubei, China has settled into a 5% mortality rate (Wuhan is the capital of the Hubei province) while Iran and Italy have eclipsed China at a rate of 7% and 8%, respectively. The current rate in the United States is 1.9%. Once again, the percentages here might suggest that the Coronavirus is almost 20 times worse than the flu.

A deeper dive leads us to question the calculated rate using today’s data. The death rate associated with any disease is generally a fairly simplistic calculation. Specifically, it is the number of deaths caused by the disease divided by the number of confirmed cases of the disease. In the coronavirus calculation of today, the numerator is actually very well-known and well-documented. Unfortunately, tracking of all confirmed cases has not been nearly as accurate. We have been told that many who have coronavirus, never get symptoms. Even when symptoms have surfaced, testing has not necessarily taken place. This means that the denominator in the death rate calculation is understated. This reality means that the death rate could very well be significantly overstated (smaller denominators, lead to larger values).

Should I give up on this data

The frustration of imperfect data often leads researchers to seek other mechanisms for insights. To some extent, going back to the drawing board may be the right approach when seeking to reach certain conclusions. For example, it probably doesn’t make sense to compare the severity of the flu to that of COVID-19 with data that has a level of confirmed cases that is known to be understated. If you are really pressed to perform that comparison, it might make sense to find some sub-group for which testing has been more comprehensive. An example would be to look at the Princess cruise population. With seven deaths from a pool of 696 confirmed cases. (A review similar to this is probably how Dr. Anthony Fauci, Director of the National Institute of Allergy and Infectious Disease, arrived at his conclusion of 1% mortality rate for COVID-19 which he stated was likely 10 times worse than the flu).

But you can use the data to get a directional sense of relationships, especially when comparing data that is captured under similar circumstances. The chart below presents a mark for each U.S. state at the intersection of their population (along the y-axis) and the number of cases they have witnessed per million people (along the x-axis). The marks which have labels are the only states for which a death has occurred (data as of 3/16/2020 from CSSE at JHU) where COVID-19 was the cause of death. The number on each label represents the number of deaths in that state. Looking at the chart, a number of things stand out and may give us insight going forward. It should be noted that some states have populations over 20 Million and some states have more than 32 cases per million, but each is capped at these values for the purposes of presentation on the chart.

COVID-19 Data Represents Snapshot on 3/16/2020 Underlying COVID-19 data sourced from CSSE at JHU

COVID-19 Data Represents Snapshot on 3/16/2020
Underlying COVID-19 data sourced from CSSE at JHU

One thing that jumps out to me is that if you were to look at only the states that appear in the upper right-hand portion of the grid (Population over 4M and greater than 8 cases per million), you will see marks for 9 states. Of those 9 states, 8 of them have had deaths. Additionally, the marks that are furthest from the origin are generally the states that have had the most deaths.

What have we learned?

Interestingly enough, the mark for Massachusetts is just a little bit above that of Colorado and the mark for Illinois is above that of Georgia. This means that they have larger populations with about the same case rate. But neither Massachusetts or Illinois has suffered a death. This might be enough for healthcare providers to look into what those states may be doing well that could be leveraged by providers in other states to improve survival for coronavirus sufferers.

Similarly, Texas has a very large population (over 20M), but also has had no deaths due to COVID-19. Interestingly, Texas has maintained a rate of infection that is below 3 per million people. This low rate could be because confirmed cases are outside of urban city centers so the spread of the virus is slower than in places like New York City. Once again, this is probably something worth looking into, to see if the Texas policy-makers may be doing something that is working to contain the spread.

Finally, a couple of states had very few confirmed cases (9 to be specific) and yet both realized a death amongst that limited case-load. This certainly could be just an anomalous circumstance, but it may be worth digging into the details to see if there are processes that could be changed in those states that would improve survivability for future patients when the number of cases potentially increases.

I think it is also worth mentioning that the U.S. death rate has been hovering around 2%. However, spikes can be seen in the state of Washington (over 5%) where many of us are familiar with the elderly care facility in Kings County that was particularly hard hit. We also see a spike in the state of Florida (over 3%) which one may assume has to do with the larger percentage of older people in the population. These situations point to the need to be exceptionally cautious with our elderly. In the case of Washington, it could also point to the possibility that these rates can be impacted if spikes occur in very specific areas. This could be a harbinger for scenarios that could occur if our healthcare system is overwhelmed.

New and Changing Patterns

It’s certainly possible that the pattern described in this article is just a random coincidence. We would need much more data and much further investigation to conclude that there was a clear indication of causality. Regardless, what we don’t want to ever do as data analysts is think that we have it all figured out because we saw one pattern in a few locations at a specific point in time associated with one particular event. Patterns can and will change. In my anti-fraud efforts, we witness these adjustments all the time. A fraudster who is thwarted on one attempt, is unlikely to repeat the next fraud in the same manner. Similarly, what we see with COVID-19 will change and I’ve included some reasons below. It is up to analysts to spot the trends when they arise.

1.      Changes in culture. If a particular city has not developed a culture of caution, it certainly is susceptible to an outbreak. If such an outbreak led to a surge in one specific location, it could overwhelm local hospitals and death rates could be affected. On the other hand, regions that develop a culture of caution to include social distancing, can reduce caseloads and likely improve outcomes. Additionally, the culture has changed with the understanding that individuals can carry the virus and be contagious without ever feeling a symptom. The virus also has an incubation period of around five days, so even if you were to feel symptoms, it likely wouldn’t be for almost a week after being infected. This knowledge has made people more careful whether or not they are feeling sick.

2.      Changes in regulation. We have heard of lockdowns in Wuhan and Italy. Regulations to a lesser extent have begun in a number of cities here at home. This includes closing restaurants or keeping them open only for pick-up and delivery, closing schools, increased work from home allowances by employers, just to name a few. Each of these actions should slow down the rate of spread from what it could have been.

3.      Changes in health care practices. As with other pandemics we have witnessed, medical advances will be made to combat COVID-19. In the short term, this has been seen by better tools to test for the virus and a better understanding for the mechanism of spread. In the hopefully not-to-distant future a vaccine will obviously have an impact to reduce the rate of spread.

4.      Changes in data collected. From the beginning of this outbreak in the U.S., the number of tests that have been administered to patients has been limited to those who meet certain requirements. We have heard that a number of states are preparing to roll-out new tests which have been developed in the past few weeks. These tests will allow for a larger number of tests to be taken and for results to be obtained sooner. The former will hopefully aid in the containment of the virus. The latter factor will undoubtedly have an immediate effect of driving down death rates simply because the denominator of that ratio (the number of cases) will increase, but maybe also because early detection may improve care. Anyone analyzing the subsequent data will need to make sure that they are aware of the lack of consistency in the data across the states (and across time) to avoid arriving at improper conclusions

5.      Further findings by data analysts. One of the early findings with this disease was the increased susceptibility of older patients compared to younger ones. The data that I have secured does not contain age information for patients, but I did hear a statistic indicating that all deaths in the U.S. were to patients 40 and above. This knowledge has risen the care with which we treat elderly and potentially increased some of the risks taken amongst the more youthful. Actions like this won’t change the science of the virus, but it could change the patterns associated with patient age.

Conclusion

The business of data analytics is all about identifying patterns in data and converting them into meaningful actions. The impact of COVID-19 has been slowed by a number of instances in which data analysts have witnessed patterns related to the age of patients, periods of contagiousness and length of incubation periods. There is no doubt that analytics will play an even greater role as we get over this global hump by helping identify leading practices in health care, effectiveness of government regulations and the viability of vaccines.