NLP | Sequence to Sequence Networks| Part 1| Processing text data

Mohammed AL-Ma'amari
Towards Data Science
4 min readNov 8, 2018

--

There are many benefits you can get by understanding NLP, you can make your own model to answer questions and use it in a chat bot, or you can make a translator to translate a text from your language to English language or the opposite, or maybe you make a text summarizer.

In this tutorial series, we will learn how to make a seq2seq network and train it to translate English text to French, or you can use it in another seq2seq purpose.

In this part of the series, we will learn about processing text data to feed it to the seq2seq network.

We will learn two methods to process texts

  • Character level processing
  • Word level processing ( using embedding)

I used a dataset of English → French sentences, You can get the dataset I used from [Here].

For other languages, you can get the datasets using [this] link

First : Character level processing

Overview :

I explained the text processing steps in the next pictures :

So, to represent the word ball :

and, , to represent the sentence hello world! :

I hope that you got some intuition of the steps of processing the text data.

Now we will do some coding using python :

First, lets import numpy :

import numpy as np

Then, load the text file:

After that, split the samples and get the necessary dictionaries :

Make the needed dictionaries to convert characters to integers and the opposite :

Compute the length of the longest sample and some other variables:

Output :

Number of E Samples  	: 160872
Number of D Samples : 160872
Number of D Chars : 115
Number of E Chars : 92
The Longest D Sample has 351 Chars
The Longest E Sample has 286 Chars

E → the input text ( Will be encoded later )
D → the output text ( Will be decoded later )

Next, we will One Hot Encode the samples by letters
ex:

Hi — -> [[0,0,0,…,1,0,0,0],[0,0,0,…,0,1,0,0]]
where we represent each sample as an array of zeros that has (n) rows and (j) columns
n = Number of Characters in the longest Sample
j = number of chars in our dictionary

We will make three sets of data :
1- Encoder Input Samples ( English sentences )
2- Decoder Input Samples ( French sentences)
3- Target ( French sentences)

Target will be the same data as Decoder Input but it will be one character ahead of it
Ex :
Decoder Input = ‘\tHow are yo’
Target = ‘How are you’

[Output]:
Shape of encoder_input_data : (160872, 286, 92)
Shape of decoder_input_data : (160872, 351, 115)
Shape of target_data : (160872, 351, 115)

Now, the data is ready to be used by a seq2seq model.

ٍSecond: Word level processing (using embedding):

overview:

In this method, we do the same steps as the first method, but here instead of make a dictionary of characters, we make a dictionary of the words used in the text we want to process or sometimes we use the most frequent 10,000 words of the text’s language.

To make it easy to understand what we are going to do, we will :

  1. Convert text to lowercase
  2. Clean data from digits and punctuation .
  3. append ‘SOS’ and ‘EOS’ to the target data:

SOS → Start of Sentience

EOS → End of Sentience

4. Make dictionaries to convert words to indexed numbers .

5. Use embedding layer to convert each word to a fixed length vector .

Word embeddings provide a dense representation of words and their relative meanings.

To learn more about word embeddings : [1] , [2], [3],[4]

6. Now, the data is ready to be used by seq2seq network.

Load the text data :

Data Cleanup :

Sample Processing :

Using Word Embeddings :

I will show the line where I used the embedding layer, the whole network will be explained in the next part of this tutorial series.

num_words : is the number of words in the dictionary we used to convert words to numbers
vec_len : The length of the vector that will represent the words

What Next:

In the next part [part 2] we will make the model and train it, then use it to translate English text to French.

References :

All the references o this series will be at the end of the last part.

You can follow me on Twitter @ModMaamari

--

--

I am a computer engineer | I love machine learning and data science and spend my time learning new stuff about them.