Ben Chuanlong Du's Blog

And let it direct your passion with reason.

How Long Does It Take to Observe a Sequence?

There are many interesting while at the same time very tricky problems in statistics. One famous question is that how many steps (expected) does it take to observe a given sequence (e.g. THTH, TTHH), if we flip a balanced coin?

This problem can be solved using (delay) renewal theory or martingales. These two methods are well explained in stochastic processes text books (e.g. Stochastic Processes, S. M. Ross), so I will not explain them here. I implemented an algorithm base on martingales in Mathematica. It can deal with all this kind of problems, no matter how long the pattern to be observed is and no matter which discrete distribution is specified for the sample space. The following are some running results in Mathematica.

sp = {"H", "T"};
pat1 = {"H","H", "T", "T"};
pat2 = {"H", "T", "H","T"};
PatternExpectedTime[pat1, sp]
16
PatternExpectedTime[pat2, sp]
20

Some people might be shock by the above results, because they either think that the result should be the same, or they think that the expected time for patter "HTHT" to occur should be smaller. Notice that pattern "HTHT" has a duplicated structure which is the reason make it harder to occur. To be clearer, we know that a random sequence generated by flipping a coin n times won't have a big chance to have either too many or too few runs. Duplicated structures usually result in too many runs in sequence, e.g. "HTHT" has four runs which is the biggest run we can have in a sequence of length 4, so it would take more time on average for "HTHT" to come out.

Here are more examples and results of the code. You need either Mathematica or CDF player to open the code. For you convenience, the code is also presented below. Notice that the code also contains a function for calculating the probability for a pattern to come out before other patterns. For a more detail description about this interesting problem see my post Which One Is the Best Strategy? and Which One Is the Best Strategy? (continued).

PatternExpectedTime::nprob = "All element of `1` are negative.";
PatternExpectedTime::ndist = "The sum of `1` is greater than 1.";
PatternExpectedTime::sp = "The sample space `1` has duplicated element(s).";
PatternExpectedTime::nmatch = "The length of `1` does not match the length of `2`.";
PatternExpectedTime[pattern_, samspace_, prob_: Null, padj_: True] :=
    (*the algorithm is based on martingales*)
    (*pattern: a list defining the pattern to be observed*)
    (*samspace: a list defining the sample space*)
    (*prob: a list defining probabilites*)
    (*padj: logical, whether to auto adjust argument prob to make it sum to 1*)

Module[{nsample, npattern, i, j, ResultProb = {}, temp, expect, probs},
    nsample = Length[samspace];
    If[Length[DeleteDuplicates[samspace]] < nsample,
        Message[PatternExpectedTime::sp, samspace];
        Return[]
    ];
    If[prob === Null,
    probs = Table[1/nsample, {nsample}],
    If[Length[prob] != nsample,
        Message[PatternExpectedTime::nmatch, prob, samspace];
        Return[]
    ];
    If[Min[prob] < 0,
    Message[PatternExpectedTime::nprob, prob];
    Return[]
    ];
    If[padj,
        probs = prob/Total[prob],
        If[Total[prob] > 1,
          Message[PatternExpectedTime::ndist, prob];
          Return[],
          probs = prob
          ];
        ];
    ];
    (*the following is the core algorithm*)
    npattern = Length[pattern];
    For[i = 1, i <= npattern, i++,
        AppendTo[ResultProb, 
        probs[[Flatten[Position[samspace, pattern[[i]]]][[1]]]]];
    ];
    expect = npattern;
    For[i = 1, i <= npattern, i++,
        temp = 1;
        For[j = i, j <= npattern, j++,
            If[pattern[[j]] == pattern[[j - i + 1]],
                temp = temp/ResultProb[[j]],
                temp = 0;
                Break[]
            ]
        ];
        expect = expect + temp - 1
    ];
    Return[expect]
]

PatternAdditionalExpectedTime[pat_, gpat_, samspace_, prob_: Null, padj_: True] :=
(*calculate the expected time for a pat to occur \
given that gpat already occurs,
it's essentially based on the function PatternExpectedTime[]*)
Module[{npat, ngpat, i, j, count, win = {}, temp, nsample, probs},
    nsample = Length[samspace];
    If[Length[DeleteDuplicates[samspace]] < nsample,
        Message[PatternExpectedTime::sp, samspace];
        Return[]
    ];
    If[prob === Null,
        probs = Table[1/nsample, {nsample}],
        If[Length[prob] != nsample,
            Message[PatternExpectedTime::nmatch, prob, samspace];
            Return[]
        ];
        If[Min[prob] < 0,
            Message[PatternExpectedTime::nprob, prob];
            Return[]
        ];
        If[padj,
            probs = prob/Total[prob],
            If[Total[prob] > 1,
                Message[PatternExpectedTime::ndist, prob];
                Return[],
                probs = prob
            ];
        ];
    ];
    npat = Length[pat];
    ngpat = Length[gpat];
    count = ngpat + 1 - npat;
    If[count >= 1,
        temp = 1;
        For[i = 1, i <= npat, i++,
            temp /= probs[[Flatten[Position[samspace, pat[[i]]]][[1]]]]
        ];
        temp -= 1;
        For[i = 1, i <= count, i++,
            For[j = 1, j <= npat, j++,
                If[gpat[[i + j - 1]] != pat[[j]],
                    AppendTo[win, -1];
                    Goto[next];
                ];
            ];
            AppendTo[win, temp];
            Label[next];
        ];
    ];
    count = Max[1, count + 1];
    For[i = count, i <= ngpat, i++,
        temp = 1;
        For[j = i, j <= ngpat, j++,
            If[gpat[[j]] == pat[[j - i + 1]],
                temp /= 
                probs[[Flatten[Position[samspace, pat[[j - i + 1]]]][[1]]]],
                temp = 0;
                Break[];
            ];
        ];
        AppendTo[win, temp - 1];
    ];
    win = Total[win];
    PatternExpectedTime[pat, samspace, probs, padj] - win - ngpat
]

PatternFirstComeOutProbability::patnum = "Argument `1` requires 2 or more elements.";
PatternFirstComeOutProbability[patterns_, samspace_, prob_: Null, padj_: True] := 
Module[{npat, A = {}, row = {}, b = {}, i, j, nsample, probs},
    (*patterns is a list which contains patterns to be compared*)
    (*dist is list which specifies a discrete distribution*)
    npat = Length[patterns];
    If[npat < 2,
        Message[PatternFirstComeOutProbability::patnum, patterns];
        Return[]
    ];
    nsample = Length[samspace];
    If[Length[DeleteDuplicates[samspace]] < nsample,
        Message[PatternExpectedTime::sp, samspace];
        Return[]
    ];
    If[prob === Null,
        probs = Table[1/nsample, {nsample}],
        If[Length[prob] != nsample,
            Message[PatternExpectedTime::nmatch, prob, samspace];
            Return[]
        ];
        If[Min[prob] < 0,
            Message[PatternExpectedTime::nprob, prob];
            Return[]
        ];
        If[padj,
            probs = prob/Total[prob],
            If[Total[prob] > 1,
                Message[PatternExpectedTime::ndist, prob];
                Return[],
                probs = prob
            ];
        ];
    ];
    A = Table[0, {i, 1, npat}, {j, 1, npat}];
    For[i = 1, i <= npat, i++,
        For[j = 1, j <= npat, j++,
            If[j == i, Continue[]];
                A[[i, j]] = PatternAdditionalExpectedTime[patterns[[i]], patterns[[j]], samspace, probs, padj];
            ];
        ];
        A = Join[Table[1, {i, npat}, {j, 1}], A, 2];
        row = Table[1, {i, npat}];
        PrependTo[row, 0];
        AppendTo[A, row];
        For[i = 1, i <= npat, i++,
            AppendTo[b, 
            PatternExpectedTime[patterns[[i]], samspace, probs, padj]
        ]
    ];
    AppendTo[b, 1];
    Return[LinearSolve[A, b]];
]

Comments