In the age of Artificial Intelligence Systems, developing solutions that don’t sound plastic or artificial is an area where a lot of innovation is happening. While Natural Language Processing (NLP) is primarily focused on consuming the Natural Language Text and making sense of it, Natural Language Generation – NLG is a niche area within NLP to generate human-like text rather than machine generated.
What’s NLG:
According to Wikipedia, Natural language generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations.
While there are so many different ways starting from a simple-rule based Text Generation to using highly advanced Deep Learning Models to perform Natural Language Generation, Here we will explore a simple but effective way of doing NLG with Markov Chain Model.
Please note, we will not get into the internals of building a Markov chain rather this article would focus on implementing the solution using the Python Module markovify
Description of Markovify:
Markovify is a simple, extensible Markov chain generator. Right now, its main use is for building Markov models of large corpora of text and generating random sentences from that. But, in theory, it could be used for other applications.
Module Installation
pip install markovify
About the Dataset:
This includes the entire corpus of articles published by the ABC website in the given time range. With a volume of 200 articles per day and a good focus on international news, we can be fairly certain that every event of significance has been captured here. This dataset can be downloaded from Kaggle Datasets.
Little About Markov Chain
Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probability of hopping, or “transitioning,” from one state to any other state — -e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. Read more about how Markov Chain works in this interactive article by Victor Powell.
Loading Required Packages
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import markovify #Markov Chain Generator
Reading Input Text File
inp = pd.read_csv('../input/abcnews-date-text.csv') inp.head(3) publish_date headline_text 020030219 aba decides against community broadcasting lic… 120030219 act fire witnesses must be aware of defamation 220030219a g calls for infrastructure protection summit
Building the text model with Markovify
text_model = markovify.NewlineText(inp.headline_text, state_size = 2)
Generate Random Text
# Print ten randomly-generated sentences using the built model for i in range(10): print(text_model.make_sentence()) federal treasurer to deliver grim climate prediction the myth of mum and baby injured after crashing into tree aussie duo in sync xenophon calls for more uranium mines tough penalties fake aboriginal art exhibition katherine govt considers afghan troop request one mans plan to fight bushfire at neath fifa confirms candidates for 2010 start no white flag on costly hail storm payouts unlikely before christmas is it really like
Now, this text could become input for a Twitter Bot, Slack Bot or even a Parody Blog. And that’s the point of generating text that sounds less like a Machine more like written by humans.
References:
-
Github Repo
Kaggle Kernel
Markov Chains explained Visually
Input Dataset