Solution for Is there an R package to calculate 1st order transition matrix from a frequency table?
is Given Below:
I have a frequency table aggregated from 800 millions of records and am wondering if I can use a package to calculate 1st order transition matrix from the frequency table, which is not symmetric because some state just never happened again. A sample of the frequency table is:
library(data.table)
model.data <- data.table(state1 = c(3, 1, 2, 3), state2 = c(1, 2, 1, 2), Freq = c(1,2,3,4))
model.data looks like this:
state1 | state2 | n |
---|---|---|
3 | 1 | 1 |
1 | 2 | 2 |
2 | 1 | 3 |
3 | 2 | 4 |
Using the package pollster, I can compute the proportion table:
library(pollster)
crosstab(model.data, state1, state2, Freq)
state1 | 1 | 2 | n |
---|---|---|---|
1 | 0 | 100 | 2 |
2 | 100 | 0 | 3 |
3 | 20 | 80 | 5 |
However, the symmetric transition matrix I am looking for is:
state1 | 1 | 2 | 3 | n |
---|---|---|---|---|
1 | 0 | 100 | 0 | 2 |
2 | 100 | 0 | 0 | 3 |
3 | 20 | 80 | 0 | 5 |
That is, I still want to include the state 3 even though no one transitioned to it, and the code should be able to automatically find out 3 needs to be appended with a column of 0s.
I am not sure if the markovchain package with the markovchainFit function is going to handle my 800 million rows of data that I need to transform into a list of millions of sequences, due to memory constraints and slow computing speed.
Does anyone know?
It appeared you might have known about the stats::xtabs
function since, since the result you are asking us to work with appears to be the result of the base::as.data.frame.table
function which converts the “wide” result from a table
call into a “long” data.frame representation of the same data. (But maybe not since you posted the pollster code that adds an additional confusing column.) Here we will be reversing that procedure so we can recover a matrix (which R table
objects inherit from).
Do notice that I’m using your data object, but not using pkg:pollster code, since your tables didn’t appear to be based on that data.table object.
How to get the zero column, … just put in a single zero data element in the state2=3
“column” position. You only need to add one data point in state2 for the entire column, but it obviously needs to be from some state1 value. It can be from any of the state 1 values:
model.data <- data.table(state1 = c(3, 1, 2, 3, 3),
state2 = c(1, 2, 1, 2, 3),
Freq = c(1,2,3,4, 0))
xtabs(Freq~state1+state2, model.data)
#------------
state2
state1 1 2 3
1 0 2 0
2 3 0 0
3 1 4 0
Note added: Just to show that this works in the “pollster” tidyverse environment …
> library(pollster)
> crosstab(model.data, state1, state2, Freq)
# A tibble: 3 x 5
state1 `1` `2` `3` n
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 100 0 2
2 2 100 0 0 3
3 3 20 80 0 5
And further note that the “n” column would need to be dropped if you wanted to make a transition matrix. (I can’t quite figure out what it represents.)
Regarding how to make a transition matrix (if that’s what is needed, then divide the matrix by the rowSums
result, since transition matrices need to have each row sum to unity)
mat <- xtabs(Freq~state1+state2, model.data)
trans_mat <- mat/rowSums(mat)
trans_mat
#-----
state2
state1 1 2 3
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.2 0.8 0.0
Now you can calculate the state at any discrete interval using matrix multiplication: See ?'%*%'
or matrix exponentiation ?expm::expm
Here’s further coding a plots related to matrix operations on transition matrices to generate Markov simulations:
Simple Markov Chain in R (visualization)
There are further statistical operations on Markov sequences delivered in the markovchain
package, but I did not see that it had anything for actually construction a transition matrix from data. I could be wrong about that since I only read the first 5 packages of the vignette. (They seemed to assume everyone would know how to do that, although when I wrote the code for the answer I linked above, I needed to go back to my books for a refresher.)
An option with igraph
model.data %>%
setorder(state1) %>%
graph_from_data_frame() %>%
as_adjacency_matrix(attr = "Freq", sparse = FALSE) %>%
proportions(1) # 1 sets rows as the margin, similar to `prop.table`
gives
1 2 3
1 0.0 1.0 0
2 1.0 0.0 0
3 0.2 0.8 0
Or with base R
> proportions(xtabs(Freq ~ ., model.data), 1)
state2
state1 1 2
1 0.0 1.0
2 1.0 0.0
3 0.2 0.8