Solved – Statistical analysis of relational database: is it possible and how

I have been struggling with flat file databases and corresponding statistical packages for almost 20 years now (from Excel to SPSS, then Stata, and currently R).

However, I have always had to convert complex and multidimensional relational databases (eg in Access or MySQL) to often overly simplified flat sheet databases, which is at best time consuming (but often means reducing the amount of information available for each analysis).

Indeed, the approach I have always followed is the typical one of converting a relational database through specific queries into one or more flat file databases. While this is simple enough for most analyses, especially univariate and bivariate, it may become more confusing for multivariable and multivariate analyses, as it requires taking multiple and complex queries, and most importantly often oversimplifying the data themselves.

Now that I try to get more acquainted with big data and data science, I wonder whether the shift to big data will require also a shift to data analysis encompassing multiple tables and relations, without diluting the efficiency and power of a relational database when it is converted into multiple flat file databases.

So, my question is, simply: is it possible to directly perform complex (eg multivariable) analyses of relational databases? And if yes, how?

This is not a philosophical question (only). For instance, I am now working on a relatively large (reaching 2000 patients) observational study on transcatheter aortic valve implantation for severe aortic stenosis (RISPEVA). It is based on a MySQL electronic case report form which corresponds to 12 separate tables with complex relations and often multiple entries per each patient. My approach so far to try to identify predictors of long-term death (eg if looking for a score) has been, as usual, to create multiple tables through queries, and then distill the key features capable of predicting death. This means going through multiple stages of analysis and, at best, it is time consuming.

My fear is however that it might overlook one or more of the relational features of the data, and thus loosing precision or accuracy. Could it be done in a different fashion, directly analyzing the relational database as it stands?

My understanding of your question is that you are interested in methods to uncover multidimensional relationships in data yet are reluctant to take low-dimensional slices of the data for analysis. This is, in a sense, the basis of many machine learning algorithms that use data in high dimensions to make predictions or classifications with often very complex rules that are learned directly from the data.

There are classes of relational methods which perhaps fit more neatly into what you are thinking of, however. For example, the infinite relational model is a Bayesian nonparametric framework for identifying hidden structure across many dimensions in a way that appears to conceptually match what you want. For a sample problem that this might be used for, consider a relational database which contains 3 tables with 3 different primary keys and containing information on a set of cases $S$, a set of patients $P$ and a set of doctors $D$ that performed procedures during these cases. I offer this as a low-dimensional example but all of this can be scaled up to include more data.

Then, suppose that you have an indicator variable denoting whether or not the patient had a good outcome. As shown in the paper I linked, you could simultaneously find partitionings of each of $S$, $D$ and $P$ such that each partition cell contained similar outcomes. This learning is done by performing optimization of the likelihood of the data under a Bayesian model. This might inform you as to which doctors are good or bad, or whether certain patients are particularly troublesome for a given procedure. Again, this framework is flexible and affords a range of generative models for the underlying process.

This may be more complex than what you desired – it's a bit of a jump from Excel or SPSS to writing custom inference code in another programming language. Still, it's how I would approach this problem.

Similar Posts:

Rate this post

Leave a Comment