I want to prove that, overall, signal B is correlated to signal A. I was thinking of using cross-correlation (in R) to measure this.

Essentially I have two kinds of signals: signal A is a series of single-valued data describing a particular song; signal B is a series of single-valued data for a user. There are many songs and many users per song, but I do not have the same number of users for every song.

For example:

`Signal A (song data), for song 1 0.994 0.986 0.955 0.890 0.795 0.650 ... Signal A (song data), for song 2 0.763 0.788 0.787 0.908 0.854 0.901 ... Signal B (user data), for user 1 listening to song 1 75 74.4 73.7 73 72.3 72 ... Signal B (user data), for user 1 listening to song 2 71 72.3 74.9 73 72.5 72.9 Signal B (user data), for user 2 listening to song 2 60.6 60.2 61 60.7 61 59.3 ... Etc. `

The series are obviously truncated for this illustration. Again, there are many songs, and not every user listened to every song.

*I am interested in whether I can draw conclusions about how well all song data (signal A) can predict all user response (signal B).*

Ideally, I would like to capture the cross-correlation in one number (one test statistic for each song), so that I may easily quantify whether there is an overall correlation between the two signals.

Using ccf (in R) gives me a value for each lag. For example:

`> print(ccf(x,y)) Autocorrelations of series ‘X’, by lag -6 -5 -4 -3 -2 -1 0 1 2 3 4 -0.242 -0.090 0.057 0.197 0.466 0.699 0.896 0.436 0.221 -0.018 -0.116 `

(Are these values the cross-correlation coefficients?)

Also, my data are not stationary. Is there any way (another function?) to test whether signals A and B are correlated across users and songs?

One approach would be to average signal B (take the mean user response) for each song, but because there are a different number of users for each song, working with means might be problematic.

So, my main questions again are:

If I perform a cross-correlation for one user data/song data pair, how do I test for significance? Will R give me a correlation coefficient at each lag, or does it only tell me which lag is significant (but not provide any test statistic)? If the latter is the case, will I need to adjust one series of data (to account for the lag) before running a normal Pearson's correlation?

What test may I use when the data are not stationary?

There are a different number of users for each song. For this reason, I can't simply take the average of all users' data for each song (to correlate the mean user data with the song data) – is that correct? Is there a way to test the correlation between signals A and B for each song (across existing users), or must I try to calculate the correlation for each user/song pair individually?

I hope my intent is clear. Thanks for any insight.

**Contents**hide

#### Best Answer

I want to prove that, overall, signal B is correlated to signal A.

If you want to prove that, you could calculate the empirical correlation and estimate its statistical significance under the assumption of $i.i.d.$ observations. However, time series data is notorious for *not* satisfying the $i.i.d.$ assumption; the conditional means and/or variances of time series usually change with time. Hence, you need some model to describe the relation between A and B and their time development (including possibly the time development of the relationship itself). Once you have built a model and validated its assumptions, you may proceed to model-based inference. For example, you may test the model's overall significance or significance of particular coefficients or their combinations. That way you may establish (or fail to establish) significant relationships between A and B. (You may think of the $i.i.d.$ case as being a very simple model that reflects constancy of means and variances (and higher order moments) and also constancy of the relationship between A and B.)

This may be too general to be directly useful, but it should provide a framework to think and develop a further discussion within. Unfortunately, I do not yet understand your problem sufficiently well to suggest a concrete model to work with.