![]() This functionality helps to clean the text with greater accuracy. ![]() These models support in identifying if each word is URL-like, email address-like, like a number, punctuation, a stop-word, etc. The spaCy has full model support for some languages. If these words were not replaced with their lemma, the subsequent machine learning algorithm would consider them entirely different features or variables with no connection to each other. For example, the words 'trouble', 'troubling', 'troubled', and 'troubles' are all reduced to the same word 'trouble'. This means that words that have similar context or meaning are grouped together. The lemmatisation is a great tool which helps to reduce the size of the entire bucket of unique words. The lookup data tables provide mapping from raw words to their origin words (or lemmas). All of these lookup data tables are distributed under the MIT licence. The spaCy distributes lookup data tables which are used with the main spaCy framework to remove stop-words, and other text cleaning functions for each language. The Text cleaner replaces the words with their lower case versions, removes URLs, email addresses, punctuation, etc., then removes all the stop-words from the document (if stop-words for the language are available), and finally returns the 'cleaned' text document. The Recommendation engine code is distributed with a set of stop-words of 58 languages. If the document that is being processed is in a language that does not have model and lookups data available, the processor directs the document to the Text cleaner. Text cleaner/lemmatiserĭepending on which language resources are available, the text is directed to one of the Text cleaner, Basic processor, or Advanced processor blocks.Įach block is summarised below. This helps in further cleaning/transformation processes. Language detectorĪ predictive language detector is used to predict the language in which the text is written. The language processing blocks are summarised below.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |