Discovering process models from unlabelled event logs #BPM2009

Diogo Ferriera of Universidade Tecnica de Lisboa presented a paper on process discovery based on unlabelled event logs: where the events in the log are only identified by the specific task, not by the process instance. Consider that a process instance may be executed via multiple paths through the process model, resulting in a different sequence of events logged: although you might know all possible paths through the model, you don’t know which one that any given instance followed. Also consider that processes will be executing simultaneously, so their events are intermingled.

Taking a probabilistic approach, you can take the event sequence and the source sequence (i.e., you know both the events and the instances that created them) and generate a matrix of probabilities that any given event will follow another within the same source instance: that’s some fairly standard math. He then took the event sequence and the matrix (which now represents a priori knowledge about how events interrelated), and did a fairly magic-looking calculation that calculated the source sequence based on that information.

The problem, of course, is that you don’t have the magic matrix, you only have the event sequence: initialize the matrix to something, then use the event sequence and the matrix to estimate the source sequence, then use the event sequence and the estimated source sequence to estimate the matrix. Wash, rinse, repeat until the matrix converges. You could initialize the matrix randomly, but that would take a while to converge (or would converge to a local maximum); instead, Ferreira pulled a rabbit out of his hat by stating that the matrix can be initialized with the transition probabilities present in the event sequence, that is, as if the event sequence were generated from a single source. As the paper states:

Even if x [event sequence] is the result of interleaving a number of sources, their underlying behaviour will be present in M+ [probability matrix] since consistent behaviour will stand out with stronger transition probabilities than the spurious effects of random interleaving. Therefore, M+ is a good initial guess for the estimation of M.

Amazingly, this works. Or maybe not so amazing, because I suppose there would be no research paper if this didn’t work. As the number of sources (instances) increases, the accuracy approaches that of when both the event and source sequence are known; as the number of overlapping sources increases, the accuracy drops to about 60% (by the time that you reach 10 overlapping instances), then flattens out.

There are a number of use cases for this: preprocessing for other process mining algorithms, or as a labeling mechanism to find the source instance when it is unknown. Or just if you want to show off some cool math at parties.

Leave a Reply Cancel reply