Research on Statistical Extraction and Visualization of Topics in the Qur’an Corpus

Unlike most books, the Holy Qur’an is not structured in a linear format. There is no consistent sequence underlying the ordering of the chapters, and even within a single chapter, many distinct topics may be addressed with little or no transition in between.

A concept that is discussed briefly within a few verses is often readdressed later in a different passage, many times in a different chapter entirely. This seemingly scattered arrangement, a by-product of the sporadic nature of the revelation and compilation process, serves as a reminder that the Holy Qur’an is meant to be a book of guidance and a reference, rather than a sequenced story or ordered instruction manual.

As a result, the reader of the Holy Qur’an is presented with an assortment of topics within a small passages, and the entire text must be read and taken into account in order to consider all verses that deal with a single topic.

When there is too much data, it can be difficult to obtain a “big picture” understanding of what the data represents. This is true in many scenarios, but particularly true when trying to summarize general ideas from large collections of text documents.

Although each individual document may contain particular merit, sometimes the general summary of the entire collection provides a greater insight on the underlying message.

However, extracting meaningful summaries of text collections can be difficult, especially for massive datasets. With the rise of machine learning in recent times, this task became much more feasible, as it became possible to identify trends and patterns in a corpus of text algorithmically.

Maysum H. Panju, in the research, applied unsupervised learning algorithms on the Arabic text of the Holy Qur’an.

In Unsupervised learning algorithms, the machine is only provided with raw data and is expected to identify patterns on its own.

One application of unsupervised learning is to uncover patterns and similarities within a collection of text documents, without external influence or annotation. Statistical methods that are unbiased by human tendencies may cause hidden patterns to emerge.

Applying statistical and computational methods on the Quran, free of human intervention, have the potential of revealing intrinsic structure behind the ancient text that could not have been uncovered by any manual investigation.

In particular, two main approaches of study were taken,

1. Topic modeling was applied to the Holy Qur’an in order to discover what topics exist within the text dictated by the words of the book itself and not influenced by any human annotator.

2. A data visualization technique allowed the verses of the Holy Qur’an to be displayed in a unique representation that demonstrates their relationship with each other based on content, rather than by chapter.

Finally, these two approaches were combined to present a stunning graphical presentation of the verses of the Qur’an organized by topical structure.


Topic modelling on display: a t-SNE plot of the Holy Qur’an with verses
coloured by topic.


The underlying topics of the Holy Qur’an were effectively extracted and identified, and verses were meaningfully clustered on a rich visual representation of the corpus. Based on the generally successful results of this research, the Holy Qur’an was proven to be a suitable subject for unsupervised machine learning techniques on the text.

For complete data analysis and results, please click on the link in the references.


Maysum H. Panju – Statistical Extraction and Visualization of Topics in the Qur’an Corpus


One thought on “Research on Statistical Extraction and Visualization of Topics in the Qur’an Corpus

Leave a Reply