Keyword Extraction

About Keyword Extraction

What are keywords or keyphrases?

A keyword is a single word, or a sequence of words (keyphrase) in the text that provide concise, high-level description of the content to readers. The extraction of keywords is a fundamental step in text summarization, information retrieval, topic model construction, clustering, and advertising systems. It is common to use keywords or keyphrases interchangeably, but researchers typically define a keyword as a single word and a keyphrase as a group of words.

Google defnes a keyword as an isolated word or phrase that provides concise high-level information about content to readers . With the increasing amount of data, users need more resources and time to understand content. Keywords make it easier to understand the meaning of a text in fewer words.

Usages of keywords

Keywords summarize the key points presented in the text. When searching for information on search engines, keywords play a signifcant role in fnding relevant content. Keywords are the mostinformative part of a text; they are the most prominent words in the text and describe its content. Keywords are necessary in situations involving huge amounts of text data that need to be processed automatically. Keywords are widely used in document summarization, indexing, categorization, and clustering of huge datasets. Many scientifc publications contain keyword lists that have been explicitly assigned by their authors. Other documents, however, have not been assigned keywords .

Keywords ofer readers a concise high-level summary of a documents content, thereby improving their understanding of that text. Keywords are the most relevant and important indicator for users seeking to grasp the fundamentals of a topic when scanning or skimming an article. Keyword extraction is a basic step in many text-mining and natural language processing (NLP) techniques, including text summarization, information retrieval, topic modeling, clustering, and content-based advertisement systems.

Isuues

As webpages are constantly updated, it is difcult to create keywords manually. Manual keyword assignment is labor intensive, time consuming, and error prone.

Specialized curators use fxed taxonomies for manual keyword generation, but in some cases, the keywords chosen by the author are not sufciently comprehensive and accurate. Without high-quality keywords, users fail to catch relevant information . Finding the relevant webpages, a user is seeking is often a challenging task for which representative keywords or keyphrases.

The majority of existing keyword extraction methods use language-dependent Natural Language Processing (NLP) based techniques, including Part-of-Speech (POS) tagging, stemming, and lemmatization, which makes it complex to generalize a method for different languages.

So far, studies on language-independent approaches have been limited because they usually perform worse than methods that take advantage of linguistic features. Extracting keywords from web documents involves two main challenges: the first is the presence of noise and irrelevant data such as navigation bars, menus, comments and advertisements and the second consists in the presence of multiple topics and multiple languages. Therefore, it is very important to have a general keyword extraction method that can extract keywords without relying on any specific language.

Fig.1 Show challanges of multiple languages in a webpage

We try to address the challenges of keyword extraction by developing and testing four new techniques, both language-dependent and language-independent as well as supervised and unsupervised. In our work special attention is paid to finding the most relevant features for identifying good keyword and keyphrase candidates. The main purpose of the research is to extract only those language-independent features of web pages in order to find a method that can be applied in different languages.

The work deals with statistical, linguistic, and structural features as well as their combinations. This diversity of approaches serves the pragmatic overall goal of finding the best available methods by assessing the relative performances of the newly developed as well as existing methods on a number of different datasets.

What are the key findings or observations of this research?

In this research work, we have developed three language-independent and one language-dependent method to extract keywords from webpages. Most existing methods rely on Natural Language Processing (NLP) techniques, including Part-of-Speech (POS) tagging, stemming, and lemmatization, which are language-dependent and makes it difficult to generalize the method to other languages. This research aims to find a method that can be applied to webpages regardless of their language, by extracting only language-independent features.

It is challenging to extract keywords from web documents for two reasons: the first is the presence of navigation bars, menus, comments, and advertisements. The second is the presence of multiple topics and multiple languages in the same page. It is therefore important to have a general keyword extraction without having to rely on a particular language.

For this purpose the author proposes four new automatic keyword extraction methods for webpages: Hrank, D-rank, WebRank, and ACI-rank

HRANK

In this section we will discuss Hrank four different sections:
(1) Introduction
(2) Methodology
(3) Python Implementation
(4) Output Section

(1) Introduction

We study the importance of the distribution of semantically similar POS tags, such as nouns, adjectives, and verbs in the extraction of relevant keywords from the web page

A new keywords extraction method that requires a minimum knowledge of DOM structure.

A simple measure TF performs better than the more complex methods.

However, the combination of nouns, adjectives, and verbs improves performance when TF fails.

A new keywords extraction method that requires a minimum knowledge of DOM structure.

The proposed method outperforms CL-Rank, TextRank, and TF. A simple measure TF performs better than the more complex methods. However, the combination of nouns, adjectives, and verbs improves performance when TF fails.

(2) Methodology

Fig. 1. presents the workflow of the proposed keywords extraction method. The method has two modules: (1) pre-processing and (2) keyword extraction. The pre-processing module involves the extraction of the natural language text from the web page. The keyword extraction module utilizes the text from the pre-processing module. In the pre-processing module, the first three functions involve the filtering of the text from all the other content of a web page. All the content of a web page is extracted using a document object model (DOM) and X-path function. The text that belongs to javascript scripting language and cascade style sheets is eliminated in the text filtering function. The special characters, such as @,*,£, or $, punctuation marks, and numbers are also filtered out using the regular expression in the text filtering function. Similarly, the text filtering function also involves the removal of the stop words from the text. The stop words are the natural language words that have minimal or no meaning, such as and, the, a, and an. The filtered text can now be utilized for natural language processing. The POS extractor, normalize text, and separate POS functions involve the natural language processing on the filtered text. The POS extractor function divides the text into tokens. A token is a whitespace-separated unit of text. The tokens are assigned the POS tags, such as nouns, adjectives, and verbs.

Dranks

Fig. 1. Workflow of Drank

The tokens with POS tags are further normalized. The normalization is the process of replacing the inflected forms of a word with the root word. The inflected form represents the different usage of a word in the sentences. For example, finds, finding, and found are the inflected forms of the word find. An inflected form of a word has a changed spelling or ending. In natural language processing, the lemmatization is used to find the inflected form of the words with different spellings, such as finds and found for the word find in the above example. Unlike lemmatization, the stemming process takes care of the prefixes and suffixes to find the root word, such as finding in the abovementioned example. The output of the normalization process is the tokens with all the inflected forms replaced with their root word. The lists of the POS-tagged tokens are provided to the separate POS function, which separates the tokens into the lists of nouns, adjectives, and verbs. The lists are provided to the count frequency function. The count frequency function calculates the frequency of the words in the separate lists having nouns, adjectives, and verbs. The top-frequent tokens are selected as candidate keywords. The semantically similar words among top-frequent tokens are grouped together using a lexical database, named as WordNet. The lexical database helps in finding the synsets of the words. The synset is a set of one or more synonyms that can be used interchangeably in some context [20].

We compute the semantic similarity of two different words using path-similarity, which is based on the WordNet [21]. The words that have no synonyms in the WordNet are removed from the lists. The path-similarity metric calculates the score between two different words in terms of their relatedness. We use path-similarity because it is very simple and it operates based on a parent-child relationship like a tree. Therefore, it is more convenient to use in our case. Three similarity matrices are created independently for the nouns, adjectives, and verbs. The similarity matrices are utilized in clustering the related words. We use an agglomerative clustering to find similar words in the lists. The clusters are scored by counting the frequencies of all the words in each cluster. The clusters are ranked according to the scores.

(3) Python Implementation

Extract Text
Preprocess Text
POS Tags Seperation
Word Net semantic similarity
Cluster Words
Keyword Ranking and Selection


              
               Import packages
              
               Imports
                import urllib
                import nltk
                import sys
                import re 
                
                import lxml
                import math
                import string
                import textwrap
                import requests
                
                from nltk.corpus import stopwords
                from bs4 import BeautifulSoup
                from nltk import word_tokenize
                from nltk.stem import WordNetLemmatizer
                from collections import defaultdict,Counter
                from nltk.corpus import stopwords
                from collections import defaultdict 
                from bs4.element import Comment
                
                from nltk import wordpunct_tokenize
                from urllib.parse import urlparse 
                
                import pandas as pd 
                import numpy as np
                
                Common_Nouns ="january debt est dec big than who use jun jan feb mar apr may jul agust dec oct nov sep dec  product continue second secodns".split(" ")
                URL_CommnWords =['','https','www','com','-','php','pk','fi','http:','http']
                URL_CommonQueryWords = ['','https','www','com','-','php','pk','fi','https:','http','http:','html','htm']
                UselessTagsText =['html','style', 'script', 'head',  '[document]','img']
                
        (1) Extract Text of Webpage
       
                
                def Scrapper1(element):
                    if element.parent.name in [UselessTagsText]:
                        return False
                    if isinstance(element, Comment):
                        return False
                    return True
                
                def Scrapper2(body):             
                    soup = BeautifulSoup(body, 'lxml')      
                    texts = soup.findAll(text=True)   
                    name =soup.findAll(name=True) 
                    visible_texts = filter(Scrapper1,texts)        
                    return u" ".join(t.strip() for t in visible_texts)
                
                def Scrapper3(text):                  
                    lines = (line.strip() for line in text.splitlines())    
                    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
                    return u'\n'.join(chunk for chunk in chunks if chunk)
                
                
                def Scrapper_title_4(URL):
                  req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
                  con = urllib.request.urlopen(req)
                  html= con.read()
                  title=[]
                  
                  soup = BeautifulSoup(html, 'lxml') 
                  title.append(soup.title.string)
                  return(title,urls)
                
                def Web_Funtion(URL):
                  req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
                  con = urllib.request.urlopen(req)
                  html= con.read()  
                  Raw_HTML_Soup = BeautifulSoup(html, 'lxml') 
                 
                  raw =Scrapper2(html)
                  Raw_text = Scrapper3(raw) 
                  return(Raw_text,Raw_HTML_Soup)  
    
                 (2) Preprocessing Text 
                 
                 def Preprocessing_Text(Raw_text):
        
                 # 1 making text as a space seperated word list
                 stopwords_nltk = set(stopwords.words("English")) 
                 Words_in_text =[]
                 for word in Raw_text.split():                    
                     Words_in_text.append(word)
             
                 
                  #2 remove numbers and special charactes from words
                     
                 alphawords_only = [word for word in Words_in_text if word.isalpha()]          
                 
                 #3 removing length 1 words
                 
                 Words_afterRemoval_onelength = [word for word in alphawords_only if len(word)>1]
             
                 #4 lower case all words
                 
                 lower_case_only = [word.lower() for word in Words_afterRemoval_onelength ]
                 
                 # Remove stopwords 
                 
                  
                 words_withoutStopwords = [word for word in lower_case_only if word not in stopwords_nltk]
                 
                 #removing words from common nouns like thank, use, gift, close
                 
                 words_withoutCommonNouns = [word for word in words_withoutStopwords if word not in Common_Nouns ]
                 
                 #return list of preprocess words
                 
                 return (words_withoutCommonNouns)
             
             #
             def Calc_words_frequency(Text_words):
                 
                 Sorted_WordCount_dict ={}  
                 word_and_fr_list=[]
                 Count_fr = Counter(Text_words)    
                 
                 for word,word_count in Count_fr.most_common():
                     word_and_fr_list.append([word, word_count])
                     Sorted_WordCount_dict[word]= word_count
                     
                 return(Sorted_WordCount_dict)
    
                Parts of Speech Tagger (POS)
    
    
                 def POS_seperator(Text):
                 adj=[]
                 verb=[]
                 nouns=[]
                 for line in Text:
                     tokens = nltk.word_tokenize(line)
                     tagged = nltk.pos_tag(tokens)   
                     
                     for x,y in tagged:
                         if y in ['NNP','NNPS','NNS','NN']:
                             nouns.append(x)
                         if y in ['JJ', 'JJR', 'JJS']:
                             adj.append(x)
                         if y in ['VB','VBD','VBG','VBN','VBP']:
                             verb.append(x)
                 return (nouns,adj,verb)
             
             
             def Count_frequencies_for_POS(N,POS_text):
                 Word_only=[]
                 Word_frequency_only=[]
                 words_and_freq = Counter(POS_text)
             
                 for word,counts in words_and_freq.most_common(N):
                     Word_only.append(word)
                     Word_frequency_only.append(counts)
             
                 return(Word_only,words_and_freq,Word_frequency_only)
            
             WordNet semantic similarity
             
    
             def Get_Synsets_Score (most_frequent_40_nouns):
                 words_list_with_synsets=[]
                 word_list_without_synsets =[]
                 for word in most_frequent_40_nouns:
                     a1 =wn.synsets(word)
                     if len(a1) > 0:
                         words_list_with_synsets.append(word)
                     else:
                         word_list_without_synsets.append(word)
                 
                 return (words_list_with_synsets,word_list_without_synsets)
            
             Clustering
             def Get_Clusters(fr,t6,clusters_to_write):
                 #f = open(clusters_to_write,'w', encoding="utf8")
              
                 simstr = ""
                 wordlist = []
                 dm = []
                 for i in t6:
                     a1 =wn.synsets(i)
                     a2 =(a1[0])
                     dm.append([])
                     wordlist.append(i)
                     
                     for x in t6:
                         b1 =wn.synsets(x)
                         b2 =([b1][0][0])                                   
                         wup1 =a2.wup_similarity(b2)         
                         if wup1 is None:
                             simstr+="0.0 "
                             dm[-1].append(1.0)
                             continue        
                         dm[-1].append(1.0-wup1)
                         simstr += str(wup1)+" " 
                     simstr += "\n"
                       
                 
                 num_clusters=8
                 agg = sklearn.cluster.AgglomerativeClustering(n_clusters=num_clusters, affinity='precomputed',linkage="complete")
                 cluster_labels=agg.fit_predict(dm)
                 k=[]
                 m =[]
                 d=[]
                 for i in range(num_clusters):       
                     for j in range(len(cluster_labels)):
                         if cluster_labels[j] == i:
                             k.append(["cluster",i])
                             k.append(t6[j])
                             k.append(fr[t6[j]])
                             m.append(k)
                         d.append(m)
                 keywords = []
                 clusters = {}
                     
                 for i in range(num_clusters):
                     clusters[i] = {}
                     clusters[i]['clusterSize'] = 0
                     clusters[i]['items'] = []
             
                 clusterSizes = [0] * num_clusters
                 for i in range(len(cluster_labels)):
                     clusters[cluster_labels[i]]['clusterSize'] += fr[t6[i]]
                     clusters[cluster_labels[i]]['items'].append([t6[i], fr[t6[i]]])
                     clusterSizes[cluster_labels[i]] += fr[t6[i]]
             
                 maxClusterSize=max(clusterSizes)
                 maxFrequency = fr[max(fr, key=fr.get)]
             
                 for i in range(num_clusters):
                     if clusters[i]['clusterSize'] < maxClusterSize*0.3:
                         continue
                     keywords.append(clusters[i]['items'][0][0])
                     for word in clusters[i]['items'][1:-1]:
                         if word[1] > 3 and word[1] > 0.2*maxFrequency:                    
                             keywords.append(word[0])       
               
                 return(keywords)
             
             
             
             Calling Hrank function
    
    
             def Hrank(URL):
                 (Raw_text,Raw_HTML_Soup) =Web_Funtion(URL)
                 preprocess_TextWords = Preprocessing_Text(Raw_text)
                 text_length = len(preprocess_TextWords)  
             
                 (nouns,adjectives,verbs) = POS_seperator(preprocess_TextWords)
                 #get top frequent 40 nouns and 2 adjectives and 1 verb 
                 length_nouns = len(nouns)
                 preprocess_TextWords = Preprocessing_Text(Raw_text)
                 text_length = len(preprocess_TextWords)     
                 words_count_dic = Calc_words_frequency(preprocess_TextWords)
             
             
                 (nouns,adjectives,verbs) = POS_seperator(preprocess_TextWords)
                     #get top frequent 40 nouns and 2 adjectives and 1 verb 
                 length_nouns = len(nouns)
             
                 (most_frequent_40_nouns,frequencies_nouns,counts_nouns)= Count_frequencies_for_POS(40,nouns) 
                 (Adjectives_two, frequencies_adjectives, count_adjective)= Count_frequencies_for_POS(2,adjectives)
                 (Verb_one,frequencies_verb,count_verb) = Count_frequencies_for_POS(1,verbs)
             
                     # two seperate list Based on the WordNet
             
                 (words_list_with_synsets, word_list_without_synsets)= Get_Synsets_Score(most_frequent_40_nouns)
                     #
                 keywords = Get_Clusters(frequencies_nouns, words_list_with_synsets,"cluster.txt")
             
                 keywords_combine =list( keywords + Adjectives_two + Verb_one)
                 return keywords_combine
             
             
             if __name__ == "__main__":    
                 URL ="http://bbc.com"
                 Keywords = Hrank(URL)
                 print (Keywords)
    
                 
                 Hrank Python Implementation ends


          
            
              Output Section Hrank
              
              Hrank Keywords

                home, uk, world, pictures, watch, reel, top, guide,
                 premier, king, royal, coronation, sport, living, event,
                 travel, victory, technology, arrest, business, culture, news, return



      
        
          
            Drank
            
              
                
                  Drank section comprises of four sections:

                   (1) Introduction (2) Methodology:
                   (3) Python Implementation (4) Output Section 
               
                
                 (1) Introduction 
              
                Work deals with webpage keyword extraction, which is crucial for the
      information retrieval task performed by search engines browsing
      through the internet. As such, keyword extraction is a specific kind
      of information extraction task, where the use of a natural language,
      or even several languages, poses severe challenges. To conquer these
      challenges, appropriate natural language processing (NLP) techniques
      have to be applied. As the method is dealing with webpages, the task
      is further complicated by the varying structure and layout of the
      pages. Even if Google search is widely and successfully used by a
      vast number of people for all so many purposes, the search results
      are often far from optimal, and processing natural language
      documents remain challenging.
    
    
    
      
Fig.3 Shows advertisement on a webpage
      The D-rank method is an unsupervised method where the candidate
      keywords were ranked based on their position in the content after
      extracting their features from the DOM structure. The author tested
      the proposed method on a dataset of webpages in three languages:
      English, Finnish, and German.
                 
                

                
              
            
          
          
             (2) Methodology 
            
              
              
                
                Fig. 4 Shows workflow of Drank method
            
           
              (3) Python Implementation
            

             Extract Text
             Preprocess Text
             POS Tags Seperation
             Word Net semantic similarity
             Cluster Words
             Keyword Ranking and Selection
           

          
          
   
 Imports
  import urllib
  import nltk
  import sys
  import re 
  
  import lxml
  import math
  import string
  import textwrap
  import requests
  
  from nltk.corpus import stopwords
  from bs4 import BeautifulSoup
  from nltk import word_tokenize
  from nltk.stem import WordNetLemmatizer
  from collections import defaultdict,Counter
  from nltk.corpus import stopwords
  from collections import defaultdict 
  from bs4.element import Comment
  
  from nltk import wordpunct_tokenize
  from urllib.parse import urlparse 
  
  import pandas as pd 
  import numpy as np
  import warnings
  warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
  
  Common_Nouns ="january debt est dec big than who use jun jan feb mar apr may jul agust dec oct ".split(" ")
  URL_CommnWords =['','https','www','com','-','php','pk','fi','http:','http']
  URL_CommonQueryWords = ['','https','www','com','-','php','pk','fi','https:','http','http:]
  UselessTagsText =['html','style', 'script', 'head',  '[document]','img']
  def Scrapper1(element):
      if element.parent.name in [UselessTagsText]:
          return False
      if isinstance(element, Comment):
          return False
      return True
  
  def Scrapper2(body):             
      soup = BeautifulSoup(body, 'lxml')      
      texts = soup.findAll(text=True)   
      name =soup.findAll(name=True) 
      visible_texts = filter(Scrapper1,texts)        
      return u" ".join(t.strip() for t in visible_texts)
  
  def Scrapper3(text):                  
      lines = (line.strip() for line in text.splitlines())    
      chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
      return u'\n'.join(chunk for chunk in chunks if chunk)
  
  
  def Scrapper_title_4(URL):
    req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
    con = urllib.request.urlopen(req)
    html= con.read()
    title=[]
    
    soup = BeautifulSoup(html, 'lxml') 
    title.append(soup.title.string)
    return(title,urls)
  
  def Web_Funtion(URL):
    req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
    con = urllib.request.urlopen(req)
    html= con.read()  
    Raw_HTML_Soup = BeautifulSoup(html, 'lxml') 
   
    raw =Scrapper2(html)
    Raw_text = Scrapper3(raw) 
    return(Raw_text,Raw_HTML_Soup) 
  
  
  def _calculate_languages_ratios(text):  
      languages_ratios = {}
      tokens = wordpunct_tokenize(text)
      words = [word.lower() for word in tokens]    
      for language in stopwords.fileids():
          stopwords_set = set(stopwords.words(language))
         
          words_set = set(words)
          common_elements = words_set.intersection(stopwords_set)
  
          languages_ratios[language] = len(common_elements) 
      return languages_ratios
  
  
  
  def detect_language(text):
      ratios = _calculate_languages_ratios(text)
      most_rated_language = max(ratios, key=ratios.get)
      stop_words_for_language = set(stopwords.words(most_rated_language))
      return most_rated_language,stop_words_for_language
  
  def Preprocessing_Text(Raw_text, stop_words_for_language):
      
      # 1 making text as a space seperated word list
      stop_words_for_language = str(stop_words_for_language).lower()
      Words_in_text =[]
      for word in Raw_text.split():                    
          Words_in_text.append(word)
  
      
       #2 remove numbers and special charactes from words
          
      alphawords_only = [word for word in Words_in_text if word.isalpha()]          
      
      #3 removing length 1 words
      
      Words_afterRemoval_onelength = [word for word in alphawords_only if len(word)>1]
  
      #4 lower case all words
      
      lower_case_only = [word.lower() for word in Words_afterRemoval_onelength ]
      
      # Remove stopwords 
      
      stopwords_nltk = set(stopwords.words("English"))  
      words_withoutStopwords = [word for word in lower_case_only if word not in stopwords_nltk]
      if stop_words_for_language != "english":
          words_withoutStopwords = [word for word in words_withoutStopwords if word not in stop_words_for_language]
      
      #removing words from common nouns like thank, use, gift, close
      
      words_withoutCommonNouns = [word for word in words_withoutStopwords if word not in Common_Nouns ]
      
      #return list of preprocess words
      
      return (words_withoutCommonNouns)
  
  
  def Calc_words_frequency(Text_words):
      
      Sorted_WordCount_dict ={}  
      word_and_fr_list=[]
      Count_fr = Counter(Text_words)    
      
      for word,word_count in Count_fr.most_common():
          word_and_fr_list.append([word, word_count])
          Sorted_WordCount_dict[word]= word_count
          
      return(Sorted_WordCount_dict)
 
  #FEATURES 
 
  def Function_ParseURL(URL):
      URL =str(URL)
      host=[]
      obj=urlparse(URL)    
      name =(obj.hostname)
      if len(name)>0:
          for x in name.split('.'):
              if x.lower() not in URL_CommonQueryWords:
                  host.append(x)
          else:
              host.append(name)
      path=[]
      host_part_URL =[]
            
      for url_parts in URL.split('/'):
          for url_part in url_parts.split('.'):            
              if (len(url_part)>0):
                  for url_words in url_part.split('-'):
                      if url_words.lower() not in URL_CommnWords and url_words.lower() not in host: 
                          path.append(url_words.lower())
              else:
                  path.append(url_parts)                
      return(host,path)
 
  def function_TexDic_Filter(Tag_TextDic):
      alt_words=[]
      if len(Tag_TextDic) > 0:
          for k,i in Tag_TextDic.items():    
     
              for x in i:
                  word=[n for n in x.split(',')]
                  for x in word:
                      words=[i for i in x.split() ]
                      for x in words:
                          alt_words.append(x)
          return(alt_words)
      else:
          return(alt_words)
      
  def function_Tag_Text(Raw_HTML_Soup,Tag_name):
      TagTextList=[]  
      for text in Raw_HTML_Soup.find_all(Tag_name):
          tag_text = text.text.strip().lower()
          TagTextList.append(tag_text)
      return TagTextList   
  
  def function_HeaderTitleAnchorText(Raw_HTML_Soup):    
      H1_TextList = function_Tag_Text(Raw_HTML_Soup,'h1')
      H2_TextList = function_Tag_Text(Raw_HTML_Soup,'h2')
      H3_TextList= function_Tag_Text(Raw_HTML_Soup,'h3')
      H4_TextList = function_Tag_Text(Raw_HTML_Soup,'h4')
      H5_TextList = function_Tag_Text(Raw_HTML_Soup,'h5')
      H6_TextList = function_Tag_Text(Raw_HTML_Soup,'h6')
      Title_TextList = function_Tag_Text(Raw_HTML_Soup,'title')
      Anchor_TextList = function_Tag_Text(Raw_HTML_Soup,'a')
      return (H1_TextList,H2_TextList,H3_TextList,H4_TextList,H5_TextList,H6_TextList,Title_TextList,Anchor_TextList)
      
      
  def function_MakeDictTagText(Raw_HTML_Soup):
       
      (H1_TextList,H2_TextList,H3_TextList,H4_TextList,H5_TextList,H6_TextList,Title_TextList,Anchor_TextList) = function_HeaderTitleAnchorText(Raw_HTML_Soup)
          
      H1_TextDict = {}
      H2_TextDict = {}
      H1_TextDict = {}
      H3_TextDict = {}
      H4_TextDict = {}
      H5_TextDict = {}
      H6_TextDict= {}
      Title_TextDict = {}
      Anchor_TextDict = {}
          
      H1_TextDict["h1"] = H1_TextList
      H2_TextDict["h2"] = H2_TextList
      H3_TextDict["h3"] = H3_TextList
      H4_TextDict["h4"] = H4_TextList
      H5_TextDict["h5"] = H5_TextList
      H6_TextDict["h6"] = H6_TextList    
      Title_TextDict["title"] = Title_TextList
      Anchor_TextDict["a"] = Anchor_TextList
      
      H1_dic = function_TexDic_Filter(H1_TextDict)
      H2_dic = function_TexDic_Filter(H2_TextDict)
      H3_dic = function_TexDic_Filter(H3_TextDict)
      H4_dic = function_TexDic_Filter(H4_TextDict)
      H5_dic = function_TexDic_Filter(H5_TextDict)
      H6_dic = function_TexDic_Filter(H6_TextDict)
      Title_dic = function_TexDic_Filter(Title_TextDict)
      Anchor_dic = function_TexDic_Filter(Anchor_TextDict)
      
      return (H1_dic, H2_dic, H3_dic, H4_dic, H5_dic, H6_dic, Title_dic, Anchor_dic)
 
  def Feature_Score(candidate_word,feature_words,score):
      total_score=0
      score_single_time =0    
      for word_feature in feature_words:        
          if word_feature ==candidate_word:            
              #total_score+=score
              score_single_time = score                
      return(score_single_time)
             
  def Tf_Score(fr,text_length):
      if text_length<50:
          tf_score =((fr/100)*50)
      else:
          tf_score=((fr/100)*20) 
      return (tf_score)   
  # 
  def function_word_Fr_TagName_ScoreDic(words_count_dic, text_length,Raw_HTML_Soup):
      wrd_fr_Tgs_Fnl_score =defaultdict()
      Word_Final_Score =defaultdict()
      Host_part_of_URL, Query_part_of_URL = Function_ParseURL(URL)
      #names of features 10
      Name_FeaturesList =np.array(['H1', 'H2', 'H3','H4', 'H5', 'H6','Title','Anchor','URL-H','URL-Q'])
      
      # Manual score for words
      Manual_Score_Each_Features =np.array([6, 5, 4,3, 2, 2, 6, 1,5,4])
      
      
      
      # Get all the words in features
      
      (H1_dic, H2_dic, H3_dic, H4_dic, H5_dic, H6_dic, Title_dic, Anchor_dic)= function_MakeDictTagText(Raw_HTML_Soup)
      featuresText_allDict_npArrayList = np.array([H1_dic, H2_dic, H3_dic, H4_dic, H5_dic, H6_dic, Title_dic, Anchor_dic, Host_part_of_URL, Query_part_of_URL])
     
      
      for word,fr in words_count_dic.items():
          tf_score = Tf_Score(fr,text_length)
          tag =[]
          name_tag =[]
                 
          for word_inAll_Dic in range (len(featuresText_allDict_npArrayList)):
              if word in featuresText_allDict_npArrayList[word_inAll_Dic]:   
                  tag.append(Manual_Score_Each_Features[word_inAll_Dic]) 
                  name_tag.append(Name_FeaturesList[word_inAll_Dic])
          score= (sum(tag))
          score = score + tf_score
          Word_Final_Score[word] = score
          wrd_fr_Tgs_Fnl_score[word] = fr,name_tag,score
      return (wrd_fr_Tgs_Fnl_score, Word_Final_Score)
 
  
  def function_Drank_KeywordExtraction(URL):
      Raw_text, Raw_HTML_Soup = Web_Funtion(URL)
      most_rated_language,stop_words_for_language = detect_language(Raw_text)
      
      preprocess_TextWords = Preprocessing_Text(Raw_text, stop_words_for_language )
      text_length = len(preprocess_TextWords)
      words_count_dic = Calc_words_frequency(preprocess_TextWords)
      
          # Features
      
      
      
      H1_TextList, H2_TextList, H3_TextList, H4_TextList, H5_TextList, H6_TextList,Title_TextList,Anchor_TextList = function_HeaderTitleAnchorText(Raw_HTML_Soup)
      
      #Feature Header, Title, Anchor text, score dictionary
      
      (wrd_fr_Tgs_Fnl_score, Word_Final_Score) = function_word_Fr_TagName_ScoreDic(words_count_dic, text_length,Raw_HTML_Soup)   
     
      Keywords =[]
      sorted_word_score = Counter(Word_Final_Score)
      
      for word,score in sorted_word_score.most_common(10):
          Keywords.append(word)
      return Keywords   
  
  if __name__ == "__main__":    
      URL ="http://bbc.com"
      Keywords = function_Drank_KeywordExtraction(URL)
      
          
          
            
  (4) Outpu Section

            
            
             
                          (1) Extract Text,Raw HTML Webpage Output
              
                  
                  
                    Raw Text

                      html
                      BBC - Homepage
                      window.orb_fig_blocking = true;
                      window.bbcredirection = {geo: true};
                      :root {
                      --bbc-font: ReithSans, Arial, Helvetica, freesans, sans-serif;
                      --bbc-font-legacy: Arial, Helvetica, freesans, sans-serif;
                      }
                      window.orbitData = {};
                      var additionalPageProperties = {};
                      additionalPageProperties['custom_var_1'] = 'international' || null;
                      additionalPageProperties['custom_var_9'] = '1' || null;
                      additionalPageProperties['experience_global_platform'] = 'orbit';
                      window.orbitData.userProfileUrl = "https://www.bbc.co.uk/userprofile";
                      window.page = {
                      name: 'home.page' || null,
                      destination: 'HOMEPAGE_GNL' || null,
                      
                      Text of Webpage
                      

                      Homepage Accessibility links Skip to content Accessibility Help BBC Account require(['idcta/statusbar'], function (statusbar) {new statusbar.Statusbar({id: 'idcta-statusbar', publiclyCacheable: true});}); Notifications Home News Sport Weather iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel Worklife Travel Future Culture TV Weather Sounds More menu
                        Search BBC
                        Search BBC
                        Home News Sport Weather iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel Worklife Travel Future Culture TV Weather Sounds Close menu
                        BBC Homepage
                       
                        Charles and Camilla crowned in historic Coronation
                        King Charles and Queen Camilla have been crowned on a day of pageantry, history - and downpours.
                        UK
                        Charles and Camilla crowned in historic Coronation
                        The story of Coronation day in extraordinary photos
                        News
                        The story of Coronation day in extraordinary photos
                        Prince Harry leaves alone after Coronation
                        UK
                        Prince Harry leaves alone after Coronation
                        Dozens of protesters arrested during Coronation
                        UK
                        Dozens of protesters arrested during Coronation
                        Real Madrid win first Copa del Rey since 2014
                        European Football
                        Real Madrid win first Copa del Rey since 2014
               
                  
                  
                   (2) Language Detection
              
                  
                  English Language
                  
                  (3) Preprocessing Text Output
                  
              BBC -Homepage Homepage Accessibility links Skip to content
                Accessibility Help BBC Account Notifications Home News Sport Weather
                iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel
                Worklife Travel Future Culture TV Weather Sounds More menu Search
                BBC Search BBC Home News Sport Weather iPlayer Sounds Bitesize
                CBeebies CBBC Food Home News Sport Reel Worklife Travel Future
                Culture TV Weather Sounds Close menu BBC Homepage Putin should be
                  
                  
                    
                      
                      Total words in a webpage : 9496 
                      Length of Words length after removing 1 length words: 1537        
                      Words length After numbers, structure removal:1349
                      Words after removing special characters:1319
                      After removing stopwords: 945
                      After removing common nouns length of Final text:922
                    
              
                    
              
                      
                    
                  
                  
               (4) Candidate Keywords
              
                  
                    
                    BBC -Homepage Homepage Accessibility links Skip to content
                    Accessibility Help BBC Account Notifications Home News Sport Weather
                    iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel
                    Worklife Travel Future Culture TV Weather Sounds More menu Search
                    BBC Search BBC Home News Sport Weather iPlayer Sounds Bitesize
                    CBeebies CBBC Food Home News Sport Reel Worklife Travel Future
                    Culture TV Weather Sounds Close menu BBC Homepage Putin should be
                    sentenced for 'criminal actions' - Zelensky The Ukrainian president
                    calls for the creation of a new war tribunal as he addresses The
                    Hague. Europe Putin should be sentenced for 'criminal actions' -
                    Zelensky 
                  

                
                  
                  (5) Feature Formation
              
                  
                  
                                        
                    H1              

                    ['bbc homepage']   

                    H2          

                    ['accessibility links', '', 'news', 'sport', 'coronation of king
                    charles iii', 'london weather', 'editor’s picks', 'latest business
                    news', 'technology of business', 'advertisement', 'new tech
                    economy', 'featured video', 'bbc world service', 'more around the
                    bbc', 'from our correspondents', 'global trade', 'new tech economy',
                    'world in pictures', 'bbc in other languages', 'more languages',
                    'explore the bbc'] 
                    
                    H3

              
                    ['us denies masterminding moscow drone attack', 'top us judge under
                    fresh scrutiny over school fees', 'ed sheeran wins thinking out loud
                    copyright case', 'what side-hustlers are really making', 'the true
                    story of the kentucky derby', 'four proud boys guilty of seditious
                    conspiracy', 'silence and teddies at scene of serbia school
                    shooting', 'prince william and kate drop into a soho pub', 'serie a:
                    napoli bidding to clinch title but go 1-0 down at udinese', 'a tour
                    of a lost world, before football changed forever', 'premier league:
                    brighton 0-0 man utd - rashford goes close to opener', 'what kings
                    wore from tudor times to now', "the 'super-deep' diamonds in the
                    crown", 'your full guide to how coronation day will unfold', 'thu',
                    'fri', 'sat', 'sun', 'a misunderstood ancient erotic manual', 'why
                    do french men pee on the street?', 'the surprisingly deadly secret
                    of the grapefruit', 'a regal scone made for king charles iii', 'do
                    you own too many clothes?', 'can remote-work gossip backfire?', 'the
                    rappers risking the death penalty', 'why the wicker man has divided
                    opinion for 50 years', 'camilla: from tabloid target to crowned
                    queen', "lizzo thanks 'king of flutes' for met gala duet", 'shell
                    reports stronger than expected profits', 'us raises interest rates
                    to highest in 16 years', 'investors sue over credit suisse
                    collapse', 'china tourism rebounds above pre-pandemic levels',
                    "branson feared 'losing everything' in pandemic", 'the revival of a
                    historic italian fruit', 'the first climate-resilient nation?', 'a
                    major positive climate tipping point', 'why there is serious money
                    in kitchen fumes', 'the people turning time into a currency',
                    "ukraine's first lady and pm's wife embrace outside no 10",
                    "ukraine's first lady and pm's wife embrace...", 'how well does
                    william pull a pint?', 'space trash floats away during spacewalk',
                    "ros atkins on... the videos showing 'kremlin...", "russian media's
                    muted response to kremlin...", 'inside hospital where oxygen runs
                    out', 'which route will the king take to his', 'watch man
                    
                    'spanish']

                    H4 [EMPTY] 

                    H5 [EMPTY] 

                    H6 [EMPTY] 

                    Title [bbc - homepage]  

                    URL-Host [bbc, bbc.com]  

                    URL- Query []
                  
                  
              
                  
                  (6) Feature Score Output
              
                  
                  
                                        +---------------+-----------+------------------------------------------+--------------------+
              |      Word     | Frequency |                   TAGS                   |    Final-Score     |
              +---------------+-----------+------------------------------------------+--------------------+
              |      bbc      |     14    | ['H1', 'H2', 'Title', 'Anchor', 'URL-H'] |        25.8        |
              |    function   |     10    |                    []                    |        2.0         |
              |    weather    |     9     |          ['H2', 'H3', 'Anchor']          |        11.8        |
              |      news     |     8     |             ['H2', 'Anchor']             |        7.6         |
              |     sport     |     8     |             ['H2', 'Anchor']             |        7.6         |
              |    business   |     8     |             ['H2', 'Anchor']             |        7.6         |
              |     return    |     7     |                    []                    | 1.4000000000000001 |
              |     watch     |     7     |             ['H3', 'Anchor']             |        6.4         |
              |      home     |     6     |                ['Anchor']                |        2.2         |
              |     sounds    |     6     |                ['Anchor']                |        2.2         |
              |     travel    |     6     |                ['Anchor']                |        2.2         |
              |     future    |     6     |                ['Anchor']                |        2.2         |
              |    charles    |     6     |          ['H2', 'H3', 'Anchor']          |        11.2        |
              |     typeof    |     5     |                    []                    |        1.0         |
              |    worklife   |     5     |                ['Anchor']                |        2.0         |
              |    culture    |     5     |                ['Anchor']                |        2.0         |
              |    football   |     5     |             ['H3', 'Anchor']             |        6.0         |
              |      tech     |     5     |          ['H2', 'H3', 'Anchor']          |        11.0        |
              |    homepage   |     4     |        ['H1', 'Title', 'Anchor']         |        13.8        |
              |      reel     |     4     |                ['Anchor']                |        1.8         |
              |     denies    |     4     |             ['H3', 'Anchor']             |        5.8         |
              |     europe    |     4     |                ['Anchor']                |        1.8         |
              |      top      |     4     |             ['H3', 'Anchor']             |        5.8         |
              |     school    |     4     |             ['H3', 'Anchor']             |        5.8         |
              | entertainment |     4     |                ['Anchor']                |        1.8         |
              |      arts     |     4     |                ['Anchor']                |        1.8         |
              |     years     |     4     |             ['H3', 'Anchor']             |        5.8         |
              |    william    |     4     |             ['H3', 'Anchor']             |        5.8         |
              |     makes     |     4     |             ['H3', 'Anchor']             |        5.8         |
              |      lady     |     4     |             ['H3', 'Anchor']             |        5.8         |
              |      wife     |     4     |             ['H3', 'Anchor']             |        5.8         |
              |    russian    |     4     |             ['H3', 'Anchor']             |        5.8         |
              |     photos    |     4     |             ['H3', 'Anchor']             |        5.8         |
              |   primitive   |     3     |                    []                    |        0.6         |
              |     catch     |     3     |                    []                    |        0.6         |
              |     event     |     3     |                    []                    |        0.6         |
              |      set      |     3     |                    []                    |        0.6         |
              | optimizelyurl |     3     |                    []                    |        0.6         |
              | accessibility |     3     |             ['H2', 'Anchor']             |        6.6         |
              |    iplayer    |     3     |                ['Anchor']                |        1.6         |
              |    bitesize   |     3     |                ['Anchor']                |        1.6         |
              |    cbeebies   |     3     |                ['Anchor']                |        1.6         |
              |      cbbc     |     3     |                ['Anchor']                |        1.6         |
              |      food     |     3     |                ['Anchor']                |        1.6         |
              |       tv      |     3     |                ['Anchor']                |        1.6         |
              |     attack    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    knowing    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |     guilty    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |   seditious   |     3     |             ['H3', 'Anchor']             |        5.6         |
              |   conspiracy  |     3     |             ['H3', 'Anchor']             |        5.6         |
              |   copyright   |     3     |             ['H3', 'Anchor']             |        5.6         |
              |     serie     |     3     |             ['H3', 'Anchor']             |        5.6         |
              |     napoli    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    udinese    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |      tour     |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    premier    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    brighton   |     3     |             ['H3', 'Anchor']             |        5.6         |
              |      iii      |     3     |          ['H2', 'H3', 'Anchor']          |        10.6        |
              |      soho     |     3     |             ['H3', 'Anchor']             |        5.6         |
              |     royal     |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    diamonds   |     3     |             ['H3', 'Anchor']             |        5.6         |
              |     videos    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |     change    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |   technology  |     3     |             ['H2', 'Anchor']             |        6.6         |
              |    branson    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    embrace    |     3     |             ['H3', 'Anchor']             |        5.6         |
              |    science    |     3     |                ['Anchor']                |        1.6         |
              |   hollywood   |     3     |             ['H3', 'Anchor']             |        5.6         |
              
                  
                  
                  (7) Final Keywords Output
              
                  
                   
                    bbc
                    coronation
                    homepage
                    charles
                    iii
                    global
                    tech
                    news
                    sport
                    business
                   
                  
                  Result Drank Ends
                  
                  
              
            
          
        
      

      
        
          
            WebRank
            
              
                
                  In this WebRank section we  we will discuss four main parts:

                   (1) Introduction (2) Methodology:
                   (3) Python Implementation (4) Output Section 
               
                
                 (1) Introduction 
              

               
                

                
              
            
          
          
             (2) Methodology 
            
              Underprocess 
                
  
        

            
             (3) Python Implementation
            WebRank is implemented in two sections (1) Text words with features (2) Training and Testing  
            (1)Text words with features
              
            
                 Import packages
               Extract text of webpage
               Detect language of the text and provide stopwords
               Prprocess text
               Feature formation 
               Creating words list in Excel file with 10 features                
                        
              
         (2)Training and Testing
           
              Bayes test and Bayes train
             KNN test and MLP train
             MLP test and KNN train
             SVM test and SVM train
             Decesion Tree test and Decesion Tree
             Max Scoring 
             Common 
        
          
          
            
                          
              import sys
from collections import defaultdict 
import re 
import numpy as np
import lxml
import sys
import math
import urllib
import nltk
from nltk import wordpunct_tokenize



import sys
from bs4 import BeautifulSoup
from bs4.element import Comment
from collections import defaultdict,Counter

from nltk.corpus import stopwords
STP_SET_ENG_NLTK = set(stopwords.words("english"))
F_stopwords = set(stopwords.words("finnish"))
url_unused_words = ['','https','www','com','-','php','pk','fi','https:','http','http:','html','htm']
english_stop_words =[x for x in STP_SET_ENG_NLTK]
finnish_stop_words =[x for x in F_stopwords]
combine_stopwords = english_stop_words + finnish_stop_words

def Scrapper1(element):
    if element.parent.name  in ['html','style', 'script']:
        return False
    if isinstance(element, Comment):
        return False
    return True
def Scrapper2(body):                #text_from_html(body):
    soup = BeautifulSoup(body, 'lxml')      
    texts = soup.findAll(text=True)   
    name =soup.findAll(name=True) 
    visible_texts = filter(Scrapper1,texts)        
    return u" ".join(t.strip() for t in visible_texts)

#raw =Scrapper2(html)#text

def Scrapper3(text):                   
    lines = (line.strip() for line in text.splitlines())    # break into lines and remove leading and trailing space on each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))# break multi-headlines into a line each
    return u'\n'.join(chunk for chunk in chunks if chunk)# drop blank lines
def Scrapper_title_4(urls):
  req = urllib.request.Request(urls, headers={'User-Agent' : "Magic Browser"}) 
  con = urllib.request.urlopen(req)
  html= con.read()  
  title=[]  
  soup = BeautifulSoup(html, 'lxml') 
  title.append(soup.title.string)
  return(title,urls)


def Web(urls):
  req = urllib.request.Request(urls, headers={'User-Agent' : "Magic Browser"})
  con = urllib.request.urlopen(req)
  html= con.read()  
  soup = BeautifulSoup(html, 'lxml')  #keywordregex = re.compile('')  
  
  raw =Scrapper2(html)
  clean_text=Scrapper3(raw) 
  
  return(clean_text,soup) 
  
# Detect language and stopwords list
def _calculate_languages_ratios(text):  
    languages_ratios = {}
    tokens = wordpunct_tokenize(text)
    words = [word.lower() for word in tokens]
    for language in stopwords.fileids():
        stopwords_set = set(stopwords.words(language))       
        words_set = set(words)
        common_elements = words_set.intersection(stopwords_set)
        languages_ratios[language] = len(common_elements) # language "score"
    return languages_ratios
def detect_language(text):
    ratios = _calculate_languages_ratios(text)
    most_rated_language = max(ratios, key=ratios.get)
    stop_words_for_language = set(stopwords.words(most_rated_language))  
    return most_rated_language,stop_words_for_language

def extract_stop_words(detected_language):
    stop_words =[]
    language_name = detected_language[0]
    for x in detected_language:
        for i in x: 
            stop_words.append (i)
    return (language_name,stop_words)

#Preprocess
def Text_Clean(Text,stopwords):
    clean_text =[]
    k=[]
    filter_text = [x.lower().strip().replace('.','').replace('‘','').replace('"','').replace('\'','').replace('?','').replace(',','').replace('-','').replace(':','').replace('!','').replace('@','').replace(')','').replace('(','').replace('#','').replace('%','').replace('"','').replace('/','').replace('\\','').replace('~','').replace('’','').replace('”','').replace(';','').replace('–','').replace('\\','').replace("  ",'').replace('/n','').replace('\n','').replace('…','').replace('“','').strip() for x in Text.split()]
    for word in filter_text:
        [clean_text.append(x)for x in word.split() if x not in stopwords and len(x)>1 and x.isalpha()]
    return(clean_text)


# Features Formations

def Divide_Url(url):
    
    from urllib.parse import urlparse 
    host=[]
    obj=urlparse(url)    
    name =(obj.hostname)
    for x in name.split('.'):
        if x.lower() not in url_unused_words:
            host.append(x)
    return(host)
def Divide_URL_HOST_QUERY (URL):
    path=[]
    host =Divide_Url(URL)   
    for x in URL.split('/'):
        for i in x.split('.'):
            for d in i.split('-'):
                if d.lower() not in url_unused_words and d.lower() not in host:
                    path.append(d.lower())
    host_dic = COUNTER_DICT(host)
    path_dic = COUNTER_DICT(path)
    return(host_dic,path_dic)

def get_text(soup,h):
    text=[]
    text2 =[]
    text_dic ={}
    
    for w in soup.find_all(h):
        h_text = w.text.strip()
        h_text =h_text.replace(':','') #change made here
        h_text =h_text.replace(',','')
        h_text =h_text.replace('|','')
        h_text =(h_text.lower())
        #change made here 
       
        for x in h_text.split('-'):
            text.append(x)
            
    if len(text)!=0:
        for x in text:
            word=[n.strip() for n in x.split(',')]
            for x in word:
                words=[i.strip() for i in x.split() ]
                for x in words:
                    text2.append(x)
                    
        text_dic = COUNTER_DICT(text2)       
        return(text_dic)
    else:
        return(text_dic)

def CHEK_NULL(word,dic):
    f =0
    if len(dic)>=1:
        
        f = dic.get(word)
    else:
        f =0
    if f is None:
        f =0
    return (f)

def Extract_headerAnchorTitle(soup):   
    h1_d= get_text(soup,'h1')
    h2_d= get_text(soup,'h2')
    h3_d=get_text(soup,'h3')
    h4_d= get_text(soup,'h4')
    h5_d= get_text(soup,'h5')
    h6_d= get_text(soup,'h6')    
    a_d= get_text(soup,'a') #alt tab or anchor
    title_d= get_text(soup,'title')  #CALLing        
    return(h1_d,h2_d,h3_d,h4_d,h5_d,h6_d,a_d,title_d)


# Manual Score each Feature

def GET_SCORE_EACH_FEATURE(word,h1_dic, h2_dic,h3_dic,h4_dic,h5_dic,h6_dic,A_dic,title_dic,URL_H_dic,URL_Q_dic):
    f1 = CHEK_NULL(word,h1_dic)
    f2 = CHEK_NULL(word,h2_dic)
    f3 = CHEK_NULL(word,h3_dic)
    f4 = CHEK_NULL(word,h4_dic)
    f5 = CHEK_NULL(word,h5_dic)
    f6 = CHEK_NULL(word,h6_dic)
    f7 = CHEK_NULL(word,A_dic)
    f8 = CHEK_NULL(word,title_dic)
    f9 = CHEK_NULL(word,URL_H_dic)
    f10 = CHEK_NULL(word,URL_Q_dic)
    
    return (f1,f2,f3,f4,f5,f6,f7,f8,f9,f10)



 
def COUNTER_DICT(list_words):
    score_dic ={}
    if len (list_words)>=1:
        list_words = [x for x in list_words if x not in combine_stopwords and len(str(x))>1 and str(x).isalpha() ]
        word_count_dict ={}
        unique_list =[]
        [unique_list.append(x) for x in list_words if x not in unique_list]
        lngth_list = len(unique_list)
        counter_list = Counter(list_words)
        word_fr_dic ={}
        
        for word,fr in counter_list.most_common():
            word_fr_dic[word]= fr
        for word in unique_list:
            fr = word_fr_dic.get(word)    
            fr_word = fr/lngth_list
            
            score_dic[word]= fr_word
        return (score_dic)
    else:        
        return ()

def WebRank(URL):        
    Text,HTML = Web(URL)
    detected_language = detect_language(Text)
    name,stop_words =extract_stop_words(detected_language)  
    
    candidate_list = Text_Clean(Text,stop_words)
    candidate_dic= COUNTER_DICT(candidate_list)
    unique_candidate_list =[]
    [unique_candidate_list.append(x) for x in candidate_list if x not in unique_candidate_list if x not in STP_SET_ENG_NLTK and x not in stop_words and len(x)>1 and x.isalpha()]
    
    #Features
    URL_H_dic,URL_Q_dic = Divide_URL_HOST_QUERY(URL)
    h1_dic, h2_dic,h3_dic,h4_dic,h5_dic,h6_dic,A_dic,title_dic = Extract_headerAnchorTitle(HTML)    
    
    # Column headers
    string="Word,Relative Frequency %,H1%,H2%,H3%,H4%,H5%,H6%,Anchor%,Title%,Url-Host,Url-Query,GT,web-id";
   
    for word in unique_candidate_list:        
        try:
            fr = candidate_dic.get(word)            
            if fr is None or not fr:
                fr =0                
            f1,f2,f3,f4,f5,f6,f7,f8,f9,f10 = GET_SCORE_EACH_FEATURE(word,h1_dic, h2_dic,h3_dic,h4_dic,h5_dic,h6_dic,A_dic,title_dic,URL_H_dic,URL_Q_dic)
           
            f12 = 0
            f11 =0
            string+="\n"+word+",";
            string+=str(fr)+",";
            string+=str(f1)+",";
            string+=str(f2)+",";
            string+=str(f3)+",";
            string+=str(f4)+",";
            string+=str(f5)+",";
            string+=str(f6)+",";
            string+=str(f7)+",";
            string+=str(f8)+",";
            string+=str(f9)+",";
            string+=str(f10)+",";
            string+=str(f11)+",";
            string+=str(f12);
        except:
            continue
    return (string,Text)

if __name__ == "__main__":    
    URL ="http://bbc.com"
    word__plus_featuresScore , Text_webpage = WebRank(URL)
                    
                  

                   (2)Features testing and Training Section 
                  
                  (1) Common.py
import math;
def readData(filename):
	from numpy import genfromtxt
	import numpy as np
	data = genfromtxt(filename, delimiter=' ')
	labels=[];
	webpageIndex=[];
	features=[];
	if(type(data[0]) is np.float64):
		features.append(data[:-2]);
		labels.append(data[-2]);
		webpageIndex.append(data[-1]);
	else:
		for i in range(0,len(data)):
			features.append(data[i][:-2]);
			labels.append(data[i][-2]);
			webpageIndex.append(data[i][-1]);
			
	return {
		"features":features,  
		"labels":labels,  
		"webpageIndices":webpageIndex
	};
	
def printStatistics(predicted,labels,indices):
	import numpy as np
	indicesSet=set(indices);
	
	resultString="";
	oldIndex=-1;
	keywordIndices="";
	binaryValues="";
	for i in range(0,len(indices)):
		binaryValues=binaryValues+str(math.floor(predicted[i]))+"\n";
		if(oldIndex!=indices[i]): # webpage id is indices[i]
			if(oldIndex!=-1):
				resultString=resultString+str(math.floor(oldIndex))+" "+keywordIndices+"\n";

			if(predicted[i]==1):
				if(keywordIndices!=""):
					keywordIndices+=" ";
				keywordIndices+=str(i);
				
			if(oldIndex!=-1):
				if(predicted[i]==1):
					keywordIndices=str(i);
				else:
					keywordIndices="";
				
			oldIndex=indices[i];
		else:
			if(predicted[i]==1):
				if(keywordIndices!=""):
					keywordIndices+=" ";
				keywordIndices+=str(i);
				
	resultString=resultString+str(math.floor(oldIndex))+" "+keywordIndices+"\n";
	return [resultString,binaryValues];
	
	
def save(model, filename):
	from sklearn.externals import joblib
	joblib.dump(model, filename) 

	
def load(filename):
	from sklearn.externals import joblib
	return joblib.load(filename) 
	
	
def getHighestProbabilities(scores,indices,top):
	import numpy as np
	indicesSet=set(indices);
	predicted=np.zeros(len(scores));
	for val in indicesSet:
		scoresInSet=[];
		minIndex=len(indices);
		for i in range(0,len(indices)):
			if(indices[i]==val):
				scoresInSet.append(scores[i]);
				minIndex=min(minIndex,i);
				
		ids=np.argsort(scoresInSet);
		
		marksInSet=np.zeros(len(scoresInSet));
		adjustedTop=min(top,len(scoresInSet));
		for i in range(0,len(marksInSet)):
			if(i>=len(marksInSet)-adjustedTop):
				marksInSet[ids[i]]=1;
				
		for i in range(0,len(indices)):
			if(indices[i]==val):
				if(marksInSet[i-minIndex]==1):
					predicted[i]=1;
			
	return predicted;
                  
                
               
               (2) Max_Score.py# Predicts the top k(10) keywords
                # 3.7.2019: RM - Implemented
                
                
                def scoreEachWebpageKeywords(features,indices,top):
                  import numpy as np
                  predicted = np.zeros(len(features));
                  scores=np.zeros(len(features));
                  for i in range (0,len(features)):
                    # Himat's scoring Drank method
                    tfscore=0.5*features[i][0]
                    if(features[i][11]>50):
                      tfscore=0.2*features[i][0]
                    scores[i]= min(1,features[i][1]) *6
                    scores[i]+=min(1,features[i][2]) *5
                    scores[i]+=min(1,features[i][3]) *3
                    scores[i]+=min(1,features[i][4]) *2
                    scores[i]+=min(1,features[i][5]) *2
                    scores[i]+=min(1,features[i][6]) *2
                    #anchor
                    scores[i]+=min(1,features[i][7]) *1
                    #title
                    scores[i]+=min(1,features[i][8]) *6;
                    #url Host
                    scores[i]+=min(1,features[i][9]) *5;
                    #url Query
                    scores[i]+=min(1,features[i][10])*4;
                    scores[i]+=   tfscore     *1;
                    
                  predicted=common.getHighestProbabilities(scores,indices,top);
                  
                  return predicted;
                  
                import common
                import sys
                  
                # PARAMETERS	
                fold=1;
                dataset="mopsi_services";#guardian,macworld,mopsi_services
                
                top=10;
                if(len(sys.argv)>1):
                  dataset=sys.argv[1];
                if(len(sys.argv)>2):
                  fold=int(sys.argv[2]);
                if(len(sys.argv)>3):
                  top=int(sys.argv[3]);
                
                
                inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";
                
                
                # DATA + MODEL READING	
                data=common.readData(inFile);
                
                # PREDICTING
                predicted=scoreEachWebpageKeywords(data["features"],data["webpageIndices"],top)
                
                # OUTPUT STATISTICS
                [resultString,binaryValues]=common.printStatistics(predicted,data["labels"],data["webpageIndices"]);
                outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt";
                f= open(outFile,"w+");
                f.write(resultString);
                f.close();
                outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt";
                f= open(outFile,"w+");
                f.write(binaryValues);
                f.close();                    
                
                
    (3) KNN Training and Testing 
   (3.1) KNN Training   

                  # Trains an KNN model 
                  # Stores it in a file  Number of neighbors is parameter          
                  
                  import common
                  import sys                  
                  from sklearn.neighbors import KNeighborsClassifier
                  
                    
                  # PARAMETERS	
                  inFile="../training.txt";
                  outFile="/home/tko/himat/web-docs/machine_learning/classification/models/knn.joblib";
                  
                  k=2; #default
                  
                  if(len(sys.argv)>1):
                    dataset=sys.argv[1];
                  if(len(sys.argv)>2):
                    fold=int(sys.argv[2]);
                  if(len(sys.argv)>3):
                    k=int(sys.argv[3]);
                  
                  
                  inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";
                  
                    
                  # DATA READING	
                  data=common.readData(inFile);
                  
                  # TRAINING
                  model = KNeighborsClassifier(n_neighbors=k)
                  model.fit(data["features"], data["labels"])  
                  
                  # SAVING MODEL
                  common.save(model,outFile);
                  print ("Model saved at %s" % (outFile));

              (3.2) KNN Test 
                   
              # Reads an KNN model from a file and performs prediction
                                   
                    import common
                    import sys
                    from sklearn import svm
                    from sklearn.model_selection import StratifiedShuffleSplit
                    from sklearn.model_selection import GridSearchCV
                    
                    from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
                      
                      
                    # PARAMETERS	
                    inFile="../testing.txt"; 
                    outFile="/home/tko/himat/web-docs/machine_learning/classification/models/knn.joblib";
                    
                    top=10;#default
                    if(len(sys.argv)>1):
                      dataset=sys.argv[1];
                    if(len(sys.argv)>2):
                      fold=int(sys.argv[2]);
                    if(len(sys.argv)>3):
                      top=int(sys.argv[3]);
                    
                    inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";
                    
                    # DATA + MODEL READING	
                    data=common.readData(inFile);
                    model=common.load(outFile);
                    
                    # PREDICTING
                    predicted=model.predict(data["features"])
                    probabilities=model.predict_proba(data["features"])
                    
                    import numpy as np
                    scores=np.zeros(len(probabilities));
                    for i in range (0,len(probabilities)):
                      scores[i]=probabilities[i][1];
                      
                    newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top);
                    
                    
                    # OUTPUT STATISTICS
                    [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]);
                    outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt";
                    f= open(outFile,"w+");
                    f.write(resultString);
                    f.close();
                    outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt";
                    f= open(outFile,"w+");
                    f.write(binaryValues);
                    f.close();
                
                 
                  (4) Bayes train and Test
                (4.1) Bayes Train
                # Trains a Bayes model and stores it in a file

# Bayes Train
import common
import sys

from sklearn.naive_bayes import GaussianNB

	
# PARAMETERS	
output="../output/";
input="../input/";

inFile="train.txt";
outFile="/home/tko/himat/web-docs/machine_learning/classification/models/bayes.joblib";

if(len(sys.argv)>1):
	dataset=sys.argv[1];
if(len(sys.argv)>2):
	fold=int(sys.argv[2]);

inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";

# DATA READING	
data=common.readData(inFile);


# TRAINING
model = GaussianNB()
model.fit(data["features"][:-1], data["labels"][:-1])  

# # SAVING MODEL
common.save(model,outFile);
print ("Model saved at %s" % (outFile));
 (4.2) Bayes Test
# Reads a Bayes model from a file and performs prediction


import common
import sys

from sklearn import svm
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
	
# PARAMETERS	
inFile="../testing.txt"; 
outFile="/home/tko/himat/web-docs/machine_learning/classification/models/bayes.joblib";

top=10;#default
if(len(sys.argv)>1):
	dataset=sys.argv[1];
if(len(sys.argv)>2):
	fold=int(sys.argv[2]);
if(len(sys.argv)>3):
	top=int(sys.argv[3]);

inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";

# DATA + MODEL READING	
data=common.readData(inFile);
model=common.load(outFile);

# PREDICTING
predicted=model.predict(data["features"])
probabilities=model.predict_proba(data["features"])

import numpy as np
scores=np.zeros(len(probabilities));
for i in range (0,len(probabilities)):
	scores[i]=probabilities[i][1];
	
newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top);


# OUTPUT STATISTICS
[resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]);
outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt";
f= open(outFile,"w+");
f.write(resultString);
f.close();
outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt";
f= open(outFile,"w+");
f.write(binaryValues);
f.close();
                
                
                (5) MLP training and Testing
                  (5.1)  MlP training
                  #MLP train and Test
                  # Trains a MLP model and stores it in a file
                  
                  
                  import common
                  import sys
                  
                  from sklearn.neural_network import MLPClassifier
                  
                    
                  # PARAMETERS	
                  output="../output/";
                  input="../input/";
                  
                  inFile="train.txt";
                  outFile="/home/tko/himat/web-docs/machine_learning/classification/models/mlp.joblib";
                  
                  if(len(sys.argv)>1):
                    dataset=sys.argv[1];
                  if(len(sys.argv)>2):
                    fold=int(sys.argv[2]);
                  
                  inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";
                  
                  L1=15;
                  L2=15;
                    
                  # DATA READING	
                  data=common.readData(inFile);
                  
                  
                  # TRAINING
                  model = MLPClassifier(solver='lbfgs', alpha=1e-5, 
                            hidden_layer_sizes=(L1, L2), random_state=1)
                  
                  
                  model.fit(data["features"][:-1], data["labels"][:-1])  
                  
                  # SAVING MODEL
                  common.save(model,outFile);
                  print ("Model saved at %s" % (outFile));
                  
                   (5.2)MLP testing 

                  # Reads a MLP model from a file and performs prediction
                 
                  
                  import common
                  import sys
                  
                  from sklearn import svm
                  from sklearn.model_selection import StratifiedShuffleSplit
                  from sklearn.model_selection import GridSearchCV
                    
                  # PARAMETERS	
                  inFile="../testing.txt"; 
                  outFile="/home/tko/himat/web-docs/machine_learning/classification/models/mlp.joblib";
                  
                  top=10;#default
                  if(len(sys.argv)>1):
                    dataset=sys.argv[1];
                  if(len(sys.argv)>2):
                    fold=int(sys.argv[2]);
                  if(len(sys.argv)>3):
                    top=int(sys.argv[3]);
                  
                  inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";
                  
                  # DATA + MODEL READING	
                  data=common.readData(inFile);
                  model=common.load(outFile);
                  
                  # PREDICTING
                  predicted=model.predict(data["features"])
                  probabilities=model.predict_proba(data["features"])
                  
                  import numpy as np
                  scores=np.zeros(len(probabilities));
                  for i in range (0,len(probabilities)):
                    scores[i]=probabilities[i][1];
                    
                  newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top);
                  
                  
                  # OUTPUT STATISTICS
                  [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]);
                  outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt";
                  f= open(outFile,"w+");
                  f.write(resultString);
                  f.close();
                  outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt";
                  f= open(outFile,"w+");
                  f.write(binaryValues);
                  f.close();
                  (6) SVM
                  (6.1)SVM train
                  # Trains an SVM model and stores it in a file
# grid optimization of parameters can be done
# 3.7.2019: RM - Implemented

import common
import sys

from sklearn import svm
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

# worry about this later
def getOptimizedParameters(features, labels):
	import numpy as np
	C_range = np.logspace(-2, 10, 13)
	gamma_range = np.logspace(-9, 3, 13)
	param_grid = dict(gamma=gamma_range, C=C_range)
	cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
	grid = GridSearchCV(svm.SVC(), param_grid=param_grid, cv=cv)
	grid.fit(features[:-1], labels[:-1])
	return grid;
	
# leave it to 0 for now or study SVM parameter optimization
optimize=0; 
	
# PARAMETERS	
inFile="../training.txt";

outFile="/home/tko/himat/web-docs/machine_learning/classification/models/svm.joblib";

if(len(sys.argv)>1):
	dataset=sys.argv[1];
if(len(sys.argv)>2):
	fold=int(sys.argv[2]);

inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";

# DATA READING	
data=common.readData(inFile);

# OPTIMIZATION
if(optimize==1):
	print ("Optimizing parameters");
	parameters=getOptimizedParameters(data["features"],data["labels"]);
	C=parameters.best_params_["C"];
	gamma=parameters.best_params_["gamma"];
	score=parameters.best_score_;
	print("The best parameters are %s with a score of %0.2f"
		% (parameters.best_params_, parameters.best_score_))	
else:
	# defaults
	C= 10.0;
	gamma=1000.0;

# TRAINING
model = svm.SVC(gamma=gamma, C=C, probability=True)
model.fit(data["features"], data["labels"]) 

# SAVING MODEL
common.save(model,outFile);
print ("Model saved at %s" % (outFile));

(6.2) SVM test
# Reads an SVM model from a file and performs prediction


import common
import sys

from sklearn import svm
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
	
# PARAMETERS	
inFile="../testing.txt"; 
outFile="/home/tko/himat/web-docs/machine_learning/classification/models/svm.joblib";

top=10;#default
if(len(sys.argv)>1):
	dataset=sys.argv[1];
if(len(sys.argv)>2):
	fold=int(sys.argv[2]);
if(len(sys.argv)>3):
	top=int(sys.argv[3]);

inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";

# DATA + MODEL READING	
data=common.readData(inFile);
model=common.load(outFile);

# PREDICTING
predicted=model.predict(data["features"])
probabilities=model.predict_proba(data["features"])

import numpy as np
scores=np.zeros(len(probabilities));
for i in range (0,len(probabilities)):
	scores[i]=probabilities[i][1];
	
newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top);


# OUTPUT STATISTICS
[resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]);
outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt";
f= open(outFile,"w+");
f.write(resultString);
f.close();
outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt";
f= open(outFile,"w+");
f.write(binaryValues);
f.close();
                  
                  
                    (7) Decesion Tree (DT) 
                    (7.1) DT Train
                    # Trains a Decision Tree model and stores it in a file
# Number of neighbors is parameter
# 15.11.2018: RM - Implemented
# 3.12.2019: RM - Updated for Keywords

import common
import sys

from sklearn import tree, export_graphviz
	
	
# PARAMETERS	
output="../output/";
input="../input/";

inFile="train.txt";
outFile="/home/tko/himat/web-docs/machine_learning/classification/models/dtree.joblib";

if(len(sys.argv)>1):
	dataset=sys.argv[1];
if(len(sys.argv)>2):
	fold=int(sys.argv[2]);

inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";
	
# DATA READING	
data=common.readData(inFile);


# TRAINING
model = tree.DecisionTreeClassifier()
model.fit(data["features"][:-1], data["labels"][:-1])  

dot_data = export_graphviz(model,
                                out_file=None,
                                filled=True,
                                rounded=True)

pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.write_png('original_tree.png')
pydot_graph.set_size('"5,5!"')
pydot_graph.write_png('resized_tree.png')


# SAVING MODEL
common.save(model,outFile);
print ("Model saved at %s" % (outFile));

(7.2) DT test


# Reads a Decision Tree model from a file and performs prediction


import common
import sys
from sklearn import svm
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
	
	
# PARAMETERS	
output="../output/";
input="../input/";

inFile="test.txt";
outFile="/home/tko/himat/web-docs/machine_learning/classification/models/dtree.joblib";

top=10;#default
if(len(sys.argv)>1):
	dataset=sys.argv[1];
if(len(sys.argv)>2):
	fold=int(sys.argv[2]);
if(len(sys.argv)>3):
	top=int(sys.argv[3]);
	
inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt";


# DATA + MODEL READING	
data=common.readData(inFile);
model=common.load(outFile);

# PREDICTING
predicted=model.predict(data["features"])
probabilities=model.predict_proba(data["features"])

import numpy as np
scores=np.zeros(len(probabilities));
for i in range (0,len(probabilities)):
	scores[i]=probabilities[i][1];
	
newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top);


# OUTPUT STATISTICS
[resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]);
outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt";
f= open(outFile,"w+");
f.write(resultString);
f.close();
outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt";
f= open(outFile,"w+");
f.write(binaryValues);
f.close();
                  
 (8) Generate Keywords from training and testing files
                  
                    #php starts	
                    $datasets=["guardian","macworld","mopsi_services"];//guardian,mopsi_services,macworld
                    $tops=[10,10,5];//guardian,mopsi_services,macworld
                    
                    $outputDir = "/home/tko/himat/web-docs/machine_learning/classification/output";
                    $inputDir = "/home/tko/himat/web-docs/machine_learning/txt_files_datasets";
                    $classifiersDir = "/home/tko/himat/web-docs/machine_learning/classification/classifiers/";
                    
                    $classificationMethod="max_scoring.py";// means Drank
                    $classificationMethod="knn";
                    $neighbors=12;
                    $classificationMethod="dtree";
                    /*$classificationMethod="bayes";
                    $classificationMethod="mlp";
                    $classificationMethod="svm";*/
                    
                    for($d=0;$d< count($datasets);$d++){
                      $dataset=$datasets[$d]; 
                      $allResults=[];
                      for($k=1;$k<=5;$k++){
                        
                        if($classificationMethod=="knn"){
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k." ".$neighbors);
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                        }else if($classificationMethod=="dtree"){
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                        }else if($classificationMethod=="bayes"){
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                        }else if($classificationMethod=="mlp"){
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                        }else if($classificationMethod=="svm"){
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                        }else{ //drank
                          $output = shell_exec('python3 '.$classifiersDir.$classificationMethod." ".$dataset." ".$k." ".$tops[$d]);
                        }
                        $results=readResults($outputDir."/".$dataset."/testing_".$k.".txt");
                        $keywords=readKeywords($inputDir."/".$dataset."/testing_kw_".$k.".txt"); // reads keywords from separate file
                        for($i=0;$i0){
                          $str.="\n";
                        }
                        for($j=0;$j0){
                            $str.=" ";
                          }
                          $str.=$results[$i][$j];
                        }
                      }
                      return $str;
                    }
                    
                  #PHP ends

 (9) Database setup
  #-php program
  #datasetup.py (Generate the testing and training data in txt in txt_files_datasets)
  php Starts
	// here we prepare training and testing data from raw information
	// 3.7.2019  : RM - Implemented
	// 28.10.2019: RM - updating to output keywords in separate files

	/* input file format
	Word,TF,H1,H2,H3,H4,H5,H6,Anchor,Title,URL-1,Url-2,Text Length,GT,webpag_id

	0-Word: total words appear in the text of the webpage
	1-TF: Term Frequency of the word how many times a particular word appearing
	2:7-H1-H6:Header tags
	8-Anchor: A text appeared inside Anchor tag
	9-Title: title tag
	10-Url-1: Host part of the url or main part of the url
	11-Url-2: Query part of url after the /
	12-Text Length: total no of words inside the webpage
	13-GT:Ground truth matching words
	14-Webpage_id:Represent unique id for each webpage
	
	*/
	
	// file is big, need extra memory
	ini_set('memory_limit', '15192M');
	
	// read input
	$inputFileName="csv_files/mopsi_services_312.csv";//mopsi_services_414.csv,macworld_220.csv,guardian_412.csv
	$outputDirectory="mopsi_services";//mopsi_services,macworld,guardian
	$myfile = fopen($inputFileName, "r") or die("Unable to open file!");
	$contents = fread($myfile,filesize($inputFileName));
	fclose($myfile);
	
	// dividing into lines
	$lines=explode("\n",$contents);
	echo count($lines)." lines\n";
	
	// grouping into webpages
	$pages=[];
	$page=[];
	$webId=-1;
	for($i=0;$i< count($lines);$i++){
		$comp=explode(",",$lines[$i]);
		$webpag_id=trim($comp[14]);
		if($webpag_id!="" && $webpag_id!="webpag_id" && $webpag_id!="web-id"){/// HIMAT FIX
			if($webId!=$webpag_id){
				$webId=$webpag_id;
				if(count($page)>0){
					array_push($pages,$page);
				}
				$page=[];
			}
			array_push($page,$lines[$i]);
		}else{
			//ignoring some header repeating or empty line
		}
	}
	if(count($page)>0){
		array_push($pages,$page);
	}
	echo count($pages)." pages\n";

	// dividing into train / test
	
	$testingPercent=0.2;
	for($k=0;$k< 5;$k++){
		$training=[];
		$testing=[];
		$lowThreshold=floor($testingPercent*$k*count($pages));
		$highThreshold=floor($testingPercent*($k+1)*count($pages));
		for($i=0;$i< count($pages);$i++){
			if($i>=$lowThreshold && $i<$highThreshold){
				array_push($testing,$pages[$i]);
			}else{
				array_push($training,$pages[$i]);
			}
		}
		//echo count($training)." training\n";
		//echo count($testing)." testing\n";
		
		// generate the feature vector files
		$trainingFileName="txt_files_datasets/$outputDirectory/training_".($k+1).".txt";
		$myfile = fopen($trainingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatFileContents($training));
		fclose($myfile);
		
		$testingFileName="txt_files_datasets/$outputDirectory/testing_".($k+1).".txt";
		$myfile = fopen($testingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatFileContents($testing));
		fclose($myfile);
		
		
		// generate the keyword mapping files
		$trainingFileName="txt_files_datasets/$outputDirectory/training_kw_".($k+1).".txt";
		$myfile = fopen($trainingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatKeywordFileContents($training));
		fclose($myfile);
		
		$testingFileName="txt_files_datasets/$outputDirectory/testing_kw_".($k+1).".txt";
		$myfile = fopen($testingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatKeywordFileContents($testing));
		fclose($myfile);
	}
	function formatFileContents($pages){
		$string="";
		for($i=0;$i< count($pages);$i++){
			for($j=0;$j< count($pages[$i]);$j++){
				$line=$pages[$i][$j];
				// generating feature vector
				$comp=explode(",",$line);
				$vector=[];
				for($k=1;$k< count($comp);$k++){ 
					// no need for word itself starting at 1
					array_push($vector,trim($comp[$k]));
				}
				for($k=0;$k< count($vector);$k++){
					$string.=$vector[$k];
					if($k< count($vector)-1){
						$string.=" ";
					}else{
						$string.="\n";
					}
				}
			}
		}
		return $string;
	}
	
	function formatKeywordFileContents($pages){
		$string="";
		for($i=0;$i< count($pages);$i++){
			for($j=0;$j< count($pages[$i]);$j++){
				$line=$pages[$i][$j];
				// generating feature vector
				$comp=explode(",",$line);
				$vector=[];
				array_push($vector,trim($comp[0]));
				for($k=0;$k< count($vector);$k++){
					$string.=$vector[$k];
					if($k< count($vector)-1){
						$string.=" ";
					}else{
						$string.="\n";
					}
				}
			}
		}
		return $string;
	}
	
PHP ends>

 (10) Test php file
  PHP starts
	// here we prepare training and testing data from raw information
	// 3.7.2019  : RM - Implemented
	// 28.10.2019: RM - updating to output keywords in separate files

	/* input file format
	Word,TF,H1,H2,H3,H4,H5,H6,Anchor,Title,URL-1,Url-2,Text Length,GT,webpag_id

	0-Word: total words appear in the text of the webpage
	1-TF: Term Frequency of the word how many times a particular word appearing
	2:7-H1-H6:Header tags
	8-Anchor: A text appeared inside Anchor tag
	9-Title: title tag
	10-Url-1: Host part of the url or main part of the url
	11-Url-2: Query part of url after the /
	12-Text Length: total no of words inside the webpage
	13-GT:Ground truth matching words
	14-Webpage_id:Represent unique id for each webpage
	
	*/
	
	// file is big, need extra memory
	ini_set('memory_limit', '15192M');
	
	// read input//mopsi_services_414.csv,macworld_204.csv,guardian_412.csv
	
	$outputDirectory="combined";
	
	$inputFileName="csv_files/guardian_402.csv";
	$myfile = fopen($inputFileName, "r") or die("Unable to open file!");
	$contents = fread($myfile,filesize($inputFileName));
	fclose($myfile);
	// dividing into lines
	$lines1=explode("\n",$contents);
	echo count($lines1)." lines\n";
	
	$inputFileName="csv_files/macworld_204.csv";
	$myfile = fopen($inputFileName, "r") or die("Unable to open file!");
	$contents = fread($myfile,filesize($inputFileName));
	fclose($myfile);
	// dividing into lines
	$lines2=explode("\n",$contents);
	echo count($lines2)." lines\n";
	
	$inputFileName="csv_files/mopsi_services_312.csv";
	$myfile = fopen($inputFileName, "r") or die("Unable to open file!");
	$contents = fread($myfile,filesize($inputFileName));
	fclose($myfile);
	// dividing into lines
	$lines3=explode("\n",$contents);
	echo count($lines3)." lines\n";
	
	$lines=[];
	for($i=0;$i< count($lines1);$i++){
		array_push($lines,$lines1[$i]);
	}
	for($i=0;$i< count($lines2);$i++){
		array_push($lines,$lines2[$i]);
	}
	for($i=0;$i< count($lines3);$i++){
		array_push($lines,$lines3[$i]);
	}
	
	
	// grouping into webpages
	$pages=[];
	$page=[];
	$webId=-1;
	for($i=0;$i0){
					array_push($pages,$page);
				}
				$page=[];
			}
			array_push($page,$lines[$i]);
		}else{
			//ignoring some header repeating or empty line
		}
	}
	if(count($page)>0){
		array_push($pages,$page);
	}
	echo count($pages)." pages\n";

	// dividing into train / test
	
	$testingPercent=0.2;
	for($k=0;$k< 5;$k++){
		$training=[];
		$testing=[];
		$lowThreshold=floor($testingPercent*$k*count($pages));
		$highThreshold=floor($testingPercent*($k+1)*count($pages));
		for($i=0;$i< count($pages);$i++){
			if($i>=$lowThreshold && $i<$highThreshold){
				array_push($testing,$pages[$i]);
			}else{
				array_push($training,$pages[$i]);
			}
		}
		//echo count($training)." training\n";
		//echo count($testing)." testing\n";
		
		// generate the feature vector files
		$trainingFileName="txt_files_datasets/$outputDirectory/training_".($k+1).".txt";
		$myfile = fopen($trainingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatFileContents($training));
		fclose($myfile);
		
		$testingFileName="txt_files_datasets/$outputDirectory/testing_".($k+1).".txt";
		$myfile = fopen($testingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatFileContents($testing));
		fclose($myfile);
		
		
		// generate the keyword mapping files
		$trainingFileName="txt_files_datasets/$outputDirectory/training_kw_".($k+1).".txt";
		$myfile = fopen($trainingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatKeywordFileContents($training));
		fclose($myfile);
		
		$testingFileName="txt_files_datasets/$outputDirectory/testing_kw_".($k+1).".txt";
		$myfile = fopen($testingFileName, "w") or die("Unable to open file!");
		fwrite($myfile, formatKeywordFileContents($testing));
		fclose($myfile);
	}
	function formatFileContents($pages){
		$string="";
		for($i=0;$i< count($pages);$i++){
			for($j=0;$j< count($pages[$i]);$j++){
				$line=$pages[$i][$j];
				// generating feature vector
				$comp=explode(",",$line);
				$vector=[];
				for($k=1;$k< count($comp);$k++){ 
					// no need for word itself starting at 1
					array_push($vector,trim($comp[$k]));
				}
				for($k=0;$k< count($vector);$k++){
					$string.=$vector[$k];
					if($k< count($vector)-1){
						$string.=" ";
					}else{
						$string.="\n";
					}
				}
			}
		}
		return $string;
	}
	
	function formatKeywordFileContents($pages){
		$string="";
		for($i=0;$i< count($pages);$i++){
			for($j=0;$j< count($pages[$i]);$j++){
				$line=$pages[$i][$j];
				// generating feature vector
				$comp=explode(",",$line);
				$vector=[];
				array_push($vector,trim($comp[0]));
				for($k=0;$k < count($vector);$k++){
					$string.=$vector[$k];
					if($k< count($vector)-1){
						$string.=" ";
					}else{
						$string.="\n";
					}
				}
			}
		}
		return $string;
	}
	
End php>

  #
  # removes the non existing files without tags 
  #rename the files 
  # 220 to 204 reduced
  # change the numbers into binary files
  import dranks as D
  from nltk.corpus import stopwords
  import re
  
  stp ="january use jun jan feb mar apr may jul agust dec oct nov sep dec  product continue one two three four five please thanks find helpful week job experience women girl apology read show eve  knowledge benefit appointment street way staff salon discount gift cost thing world close party love letters rewards offers special close  page week dollars voucher gifts vouchers welcome therefore march nights need name pleasure show sisters thank menu today always time needs welcome march february april may june jully aguast september october november december day year month minute second secodns".split(" ")
  common_nouns='debt est dec big than who of com offer sale the in fi'.split(" ")
  stps=set(stopwords.words("english"))
  spchars = re.compile('\`|\~|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\_|\+|\=|\\|\||\{|\[|\]|\}|\:|\;|\'|\"|\<|\,|\>|\?|\/|\.|\- ')
                       
  from collections import defaultdict 
  from flask import Flask
  import requests
  from bs4 import BeautifulSoup
  import dranks as D
  import warnings
  warnings.filterwarnings("ignore")
  import csv
  ###########################################################################################################################################
  import re
  spchars = re.compile('\`|\~|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\_|\+|\=|\\|\||\{|\[|\]|\}|\:|\;|\'|\"|\<|\,|\>|\?|\/|\.|\- ’')
  import dranks as D
  stp ="january use jun jan feb mar apr may jul agust dec free oct nov sep dec  product continue one two three four five please thanks find helpful week job experience women girl apology read show eve  knowledge benefit appointment street way staff salon discount gift cost thing world close party love letters rewards offers special close  page week dollars voucher gifts vouchers welcome therefore march nights need name pleasure show sisters thank menu today always time needs welcome march february april may june jully aguast september october november december day year month minute second secodns".split(" ")
  common_nouns=['debt', 'est' 'dec' ,'big' ,'than', 'who','one','two','three','of','four','five','al sisi','free','gift','voucher','vouchers','try','best buy buying']
  common_words =['of','for','the','www','fi','com','free','try','best']
  name_of_file=['germon_stopwords_file','finnish_stopwords_file','english_stpwords_list']
  
  germon_stopwords_file=D.Stop_Words_list(name_of_file[0])
  finnish_stopwords_file=D.Stop_Words_list(name_of_file[1])
  english_stpwords=D.Stop_Words_list(name_of_file[2])
  stop_word_list =[english_stpwords, finnish_stopwords_file, germon_stopwords_file]
  ##################################################################################################
  def Clean_text(text):
      Words =[]
      for word in text.split():       
          word = word.replace("’",' ')
          word = word.lower()
          word = spchars.sub(" ",word.strip())
          if word not in stps:
              if word not in stp:
                  if len(word)>1:
  
                      if word != "  ":
                          if word not in common_nouns:
                              if not word.isdigit():
                                  if word not in stop_word_list[0]:
                                      for x in word.split():
                                          if x not in stp and x not in stps and len(x)>1 and x not in common_nouns and x not in common_words:
                                              x = x.strip()
                                              if not x[0].isdigit():
                                                  
                                                      
  
  
  
                                                  Words.append(x)                                  
  
      return (Words)
                
  
  
  
  
  def Remove_duplicates_GT(l):
      new_list =[]
      [new_list.append(x) for x in l if x not in new_list]
      return (new_list)
  
  def Clean_CSV_Files(words):
      w=[]
      import string
      from string import digits
      exclude = set(string.punctuation)
      if type(words) is not list:
          words = [(spchars.sub(" ", i)).replace('\n','').strip().lstrip(digits)for i in words.split(',') if i not in exclude ]
      else:
          words =[(spchars.sub(" ", i)).replace('\n','').strip().lstrip(digits)for i in words if i not in exclude ]
  
      for x in words:
          if len(x)>0:
              w.append(x)
      return(w)
  
  
  def Tf_Score(fr,text_length):
      if text_length < 50:
          tf_score =((fr/100)*50)
      else:
          tf_score=((fr/100)*20) 
      return (tf_score)
  
  
  def Feature_Score(candidate_word,feature_words,score):
      total_score=0
      score_single_time =0
      
  
      for word_feature in feature_words:
          
          if word_feature ==candidate_word:
              
              #total_score+=score
              score_single_time = score
                  
      return(score_single_time)                 
  ###############################################################################################
  def Get_guardian_url_words_list(root):
      Score =defaultdict()
      csv.register_dialect('myDialect',delimiter = ',', quoting=csv.QUOTE_NONE, skipinitialspace=True)
      stop_word_list =[english_stpwords, finnish_stopwords_file, germon_stopwords_file]      
      with open('guardian_402.csv', 'a',encoding ='utf-8') as f:
          writer = csv.writer(f, dialect='myDialect')
          writer.writerow(['Word','TF','H1','H2','H3','H4','H5','H6','Anchor','Title','Url-Host','Url-Query','Txt-Lngth','GT','web-id'])   
          for web_id in range(402):            
              try:               
                  files= root + str(web_id) + '/tags.txt'
                  urls,GT=D.Read_Txt(files) # seperates the ground truth and url
                  url=str(urls)                                              
                  text,HTML = D.Web(url)
                  H1, H2, H3, H4, H5, H6, anchor, title = D.Extract_headerAnchorTitle(HTML)
                  url_host, url_query = D.Urls(url)      
                  words =Clean_text(text)                
                  text_length =len(words)      
                  words_and_freq = D.Calc_word_frequency(words)                
                  score = []
                  in_gt =0               
                  for word,fr in words_and_freq.items():
                      score_h1 = Feature_Score(word,H1,1)     #6               
                      score_h2 = Feature_Score(word,H2,1)#5
                      score_h3 = Feature_Score(word,H3,1)#3
                      score_h4 = Feature_Score(word,H4,1)#2
                      score_h5 = Feature_Score(word,H5,1)#2
                      score_h6 = Feature_Score(word,H6,1)#2
                      score_anchor = Feature_Score(word,anchor,1)#1
                      score_title = Feature_Score(word,title,1)#6
                      score_url_host = Feature_Score(word,url_host,1)#5
                      score_url_query = Feature_Score(word,url_query,1)#4
                      tf_score = Tf_Score(fr,text_length)
                      if word not in GT:
  
                          in_gt = 0                        
                      else:
                          in_gt = 1
                      writer.writerow([word,str(fr),str(score_h1),str(score_h2),str(score_h3),str(score_h4),str(score_h5),str(score_h6),str(score_anchor),str(score_title),str(score_url_host),str(score_url_query),str(text_length),str(in_gt),str(web_id)])
                      #score =word,str(fr),str(score_h1),str(score_h2),str(score_h3),str(score_h4),str(score_h5),str(score_h6),str(score_anchor),str(score_title),str(score_url_host),str(score_url_query),str(text_length),str(in_gt),str(web_id)]
                      #print ([word,str(fr),str(score_h1),str(score_h2),str(score_h3),str(score_h4),str(score_h5),str(score_h6),str(score_anchor),str(score_title),str(score_url_host),str(score_url_query),str(text_length),str(in_gt),str(web_id)])
                                             
              except:
                  print (web_id)
                  pass
  ##############################################################################################################        
  root ='/home/tko/himat/web-docs/keywordextraction/dataset2/theguardian/'
  from collections import defaultdict,Counter
  Get_guardian_url_words_list(root)
     
  #####################################################################################################
  
  def Rename_files(path):
      import os
      files = os.listdir(path)
      i = 0
      for file in files:
          os.rename(os.path.join(path, file), os.path.join(path, str(i)))
          i = i+1
  
  #path=r'/home/tko/himat/web-docs/keywordextraction/sets/indianexpress/'
  #Rename_files(root)
  ###########################################################################
  import os
  def File_exists(root,ranges):
      for x in range (ranges):
          my_path = root + str(x) + '/tags.txt'
  
          if os.path.exists(my_path) and os.path.getsize(my_path) > 0:
              p =0
          else:
              print (x)
  ####################################################################################
  #Creating the ground truth (GT) txt file f
  def GT_txt():
      txt_file = open('gt.txt','w',encoding ='utf-8')
  
      for web_id in range(402):     
          files= root + str(web_id) + '/tags.txt'
          urls,GT=D.Read_Txt(files)
          GT =' '.join(GT)
  
          txt_file.write(str(web_id) + ' ' + str(GT) + '\n')
          print (GT)
          print ('---------------------',web_id)
      txt_file.close()
  


          

          
            
              
              
            
          
        

      


      
        
          
            ACI-Rank
            
              
                
                  In this ACI-Rank section we  we will discuss four main parts:

                   (1) Introduction (2) Methodology:
                   (3) Python Implementation (4) Output Section 
               
                
                 (1) Introduction 
          
              Update soon 
              
                
              
           
          
          
             (2) Methodology 
            
               Underprocess. 
                
  
                
Fig. 6 Shows ACI-Rank workflow
                
            
             (3) Python Implementation
            

             Extract Text
             Preprocess Text
             POS Tags Seperation
             Word Net semantic similarity
             Cluster Words
             Keyword Ranking and Selection
           

          
          
   
          # Imports
            import urllib
            import nltk
            import sys
            import re 
            
            import lxml
            import math
            import string
            import textwrap
            import requests
            
            from nltk.corpus import stopwords
            from bs4 import BeautifulSoup
            from nltk import word_tokenize
            from nltk.stem import WordNetLemmatizer
            from collections import defaultdict,Counter
            from nltk.corpus import stopwords
            from collections import defaultdict 
            from bs4.element import Comment
            Common_Nouns ="january debt est dec big than who use jun jan feb mar apr may jul agust dec oct ".split(" ")
            URL_CommnWords =['','https','www','com','-','php','pk','fi','http:','http']
            URL_CommonQueryWords = ['','https','www','com','-','php','pk','fi','https:','http','http:']
            UselessTagsText =['html','style', 'script', 'head',  '[document]','img']
            from nltk import wordpunct_tokenize
            from urllib.parse import urlparse 
            
            
            warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
            warnings.filterwarnings("ignore")
            from nltk.stem.snowball import SnowballStemmer
            stemmer = SnowballStemmer("finnish")
            # Stopwords imports
            from nltk.corpus import stopwords
            STP_SET_ENG_NLTK = set(stopwords.words("english"))
            F_stopwords = set(stopwords.words("Finnish"))
            english_stop_words =[x for x in STP_SET_ENG_NLTK]
            finnish_stop_words =[x for x in F_stopwords]
            Eng_Finn_Combine_Stpwrds = english_stop_words + finnish_stop_words
            
            # New imports
            def Scrapper1(element):
                if element.parent.name in [UselessTagsText]:
                    return False
                if isinstance(element, Comment):
                    return False
                return True
            
            def Scrapper2(body):             
                soup = BeautifulSoup(body, 'lxml')      
                texts = soup.findAll(text=True)   
                name =soup.findAll(name=True) 
                visible_texts = filter(Scrapper1,texts)        
                return u" ".join(t.strip() for t in visible_texts)
            
            def Scrapper3(text):                  
                lines = (line.strip() for line in text.splitlines())    
                chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
                return u'\n'.join(chunk for chunk in chunks if chunk)
            
            
            def Scrapper_title_4(URL):
              req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
              con = urllib.request.urlopen(req)
              html= con.read()
              title=[]
              
              soup = BeautifulSoup(html, 'lxml') 
              title.append(soup.title.string)
              return(title,urls)
            
            def Web_Funtion(URL):
              req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
              con = urllib.request.urlopen(req)
              html= con.read()  
              Raw_HTML_Soup = BeautifulSoup(html, 'lxml') 
             
              raw =Scrapper2(html)
              Raw_text = Scrapper3(raw) 
              return(Raw_text,Raw_HTML_Soup) 
            
            ##################################################################
            def Clean_NoSyn(No_syn):
                Words =[]
                for x in No_syn:
                    x = x.strip('.').strip(':').strip('?').strip('/').strip("'").strip ("©").strip("»").strip("/").strip(" ").strip(",")
                    for n in x.split('.'):
                        for k in n.split('-'):
                            for m in k.split('/'):
                                if m not in ["©","»","/"," "] and len(m)>1 and m.isalpha():
                                    Words.append(m)
                return (Words)
            
            def explode(h_d):
                alt_words=[]
                if len(h_d)>0:
                    for k,i in h_d.items():      
               
                        for x in i:
                            word=[n for n in x.split(',')]
                            for x in word:
                                words=[i for i in x.split() ]
                                for x in words:
                                    alt_words.append(x)
                    return(alt_words)
                else:
                    return(alt_words)
                
            def get_text(soup,h):
                text=[]
                zero=[]
                for w in soup.find_all(h):
                    h_text = w.text.strip()
                    h_text =h_text.replace(':','') #change made here
                    h_text =h_text.replace(',','')
                    
                    #h_text =(h_text.lower())
                    #change made here 
                    for x in h_text.split('-'):
                        text.append(x)
                if len(text)!=0:
                    return(text)
                else:
                    return(zero)
                
            def Extract_headerAnchorTitle(soup):
                h1_d ={}
                h2_d ={}
                h3_d ={}
                h4_d ={}
                h5_d ={}
                h6_d ={}
                title_d={}
                
                h1_d['h1']= get_text(soup,'h1')
                h2_d['h2']= get_text(soup,'h2')
                h3_d['h3']=get_text(soup,'h3') 
                  
                title_d['title']= get_text(soup,'title')  #CALLing      
                H1=explode(h1_d)
                H2=explode(h2_d)
                H3=explode(h3_d)
                
            
                
                T= explode(title_d)
                return(H1,H2,H3,T)
            
            
            def Bold_italic_text(HTML):    
                bold_italic_text2 =[]
                bold = [w.text for w in HTML.find_all('bold')]
                italic = [w.text for w in HTML.find_all('i')]
                bold2 = [w.text for w in HTML.find_all('b')]
                strong  =bold2 = [w.text for w in HTML.find_all('strong')]
                bold_italic_text = bold + italic + strong
                for x in bold_italic_text:
                    x = x.split()
                    for i in x:
                        bold_italic_text2.append(i)         
                               
                return (bold_italic_text2)
            
            def Bold_italic_Score(feature,score,Upper,Capital,Stpwords_list):
                feature_dic ={}
                if len(feature)> 0: 
                    feature =[x for x in feature if x not in Stpwords_list ]
                    
                    if Upper is True:
                        list_bold = [x for x in feature if len (x) >1 and x[0].isupper() and not x[1].isupper()]
                    if Capital is True:
                        list_bold = [x for x in feature if len (x)>1 and x.isupper() ]
                        
                        
                    
                    len_f = len(feature)
                    Counters = Counter (list_bold)
                    for x,i in Counters.most_common():
                        v = (i /len_f) *score
                        feature_dic[x.lower()]=v
                        
                        
                return (feature_dic)  
            
            def Get_Nosynsets(Text):
                no_syn_words =[]
                for i in Text.split():
                    
                    a1 =wn.synsets(i)
                    
                    if (len(a1))< 1:
                        
                        if i not in STP_SET_ENG_NLTK and len(i) > 1:
                            
                            no_syn_words.append(i.lower())
                
               
               
                Words = Clean_NoSyn(no_syn_words)
               
                return (Words)
            
            
            def Score_feature(feature,score,stopwords_list):
                feature_dic ={}
                U_first = Bold_italic_Score(feature,2,True,False,stopwords_list)
                C_all= Bold_italic_Score(feature,3,False,True,stopwords_list)    
                
                if len(feature)> 0:   
                    Score =0
                    feature = [x for x in feature if x not in stopwords_list]
                    len_f = int(len(feature))
                    
                    Counter_feature = Counter(feature)
                    
                    for x, i in Counter_feature.most_common():
                        
                        v = (i /len_f) *score
                        U = U_first.get(x)
                        C = C_all.get(x)
                        Score = v
                        #if C is not None:
                            #Score += C
                        #if U is not None:
                            #Score += U
                            
                        
                        feature_dic[x.lower()]=Score
                else:
                    return (feature_dic)
                return (feature_dic)
            
            def Check_null(word, feature_dict):
                 m1 = feature_dict.get(word)
                 if m1 is None:
                     m1 = 0
                 return (m1)
            def Check_value(word,h1,h2,h3,host,Query,Title):
                m1 = Check_null(word, h1)
                m2 = Check_null(word, h2)
                m3 = Check_null(word, h3)    
                
                m4 = Check_null(word, host)
                m5 = Check_null(word, Query)
                m6 = Check_null(word, Title)  
                return (m1,m2,m3,m4,m5,m6)
            def Get_Nouns_without_Stopwords(Text):
                Nouns =[]
                for line in Text.split():
                   
                    
                    tokens = nltk.word_tokenize(line)
                    
                    tagged = nltk.pos_tag(tokens)    
                    for x,y in tagged:
                      if y in ['NNP','NNPS','NNS','NN']:
                          #Nouns.append(x)
                          if x not in STP_SET_ENG_NLTK:
                              Nouns.append(x)                
                return (Nouns)
            def Function_ParseURL(URL):
                URL =str(URL)
                host=[]
                obj=urlparse(URL)    
                name =(obj.hostname)
                if len(name)>0:
                    for x in name.split('.'):
                        if x.lower() not in URL_CommonQueryWords:
                            host.append(x)
                    else:
                        host.append(name)
                path=[]
                host_part_URL =[]
                      
                for url_parts in URL.split('/'):
                    for url_part in url_parts.split('.'):            
                        if (len(url_part)>0):
                            for url_words in url_part.split('-'):
                                if url_words.lower() not in URL_CommnWords and url_words.lower() not in host: 
                                    path.append(url_words.lower())
                        else:
                            path.append(url_parts)                
                return(host,path)
            
            
            def Frequent_Words(Text):
                 
                #1 Remove stopwords and pre-process
                Cand_Words = [x for x in Text.split() if x not in Eng_Finn_Combine_Stpwrds]
                Cand_Words= Clean_NoSyn( Cand_Words)
                Cand_Words = [x.strip().lower() for x in Cand_Words if x not in ["©","»","/"," "] and len(x)>1 and x.isalpha()]
                   
                #4 Counting freuqencies of candidate words
                Cand_100_Words_list =[]
                lengt_text = len(Cand_100_Words_list) 
                Count_Cand_Words = Counter(Cand_Words)
                Top_10_keywords=[]
                for word,count in Count_Cand_Words.most_common(10):
                    Top_10_keywords.append(word)
                return (Top_10_keywords)
            
            
            def Generate_100_Candidate_Keywords(URL,Text, HTML,STEMS):
                
                #1 Remove stopwords and pre-process
                Cand_Words = [x for x in Text.split() if x not in Eng_Finn_Combine_Stpwrds]
                Cand_Words= Clean_NoSyn( Cand_Words)
                Cand_Words = [x.strip().lower() for x in Cand_Words if x not in ["©","»","/"," "] and len(x)>1 and x.isalpha()]
                
            
                
                #4 Counting freuqencies of candidate words
                Cand_100_Words_list =[]
                lengt_text = len(Cand_100_Words_list) 
                Count_Cand_Words = Counter(Cand_Words)
                for word,count in Count_Cand_Words.most_common(100):        
                    
                    if STEMS:
                        Cand_100_Words_list.append (stemmer.stem(word))
                    else:
                        Cand_100_Words_list.append(word)
                    
                return (Cand_100_Words_list)
            
            def Extract_keywords_Base(URL,Text, HTML, N):
                #Base method only make all false
                STEMS =False        
                Cand_100_Words_list = Generate_100_Candidate_Keywords(URL,Text, HTML,STEMS)    
                    #2 Features URL list
                url_host, url_query = Function_ParseURL(URL)
                url_query = [x.strip().lower() for x in url_query if x not in ["©","»","/"," "] and len(x)>1 and x.isalpha()]
                
                #3 Feature Headers and title list
                H1, H2, H3,title = Extract_headerAnchorTitle(HTML)
                bold_italic = Bold_italic_text(HTML)
                bold_italic = Clean_NoSyn(bold_italic)
                    
                #5 Score to the base Features(6) words
                   
                h1 = Score_feature(H1,4,Eng_Finn_Combine_Stpwrds)
                h2 = Score_feature (H2,3,Eng_Finn_Combine_Stpwrds)
                h3 = Score_feature(H3,2,Eng_Finn_Combine_Stpwrds)     
                host =Score_feature(url_host,4,Eng_Finn_Combine_Stpwrds)
                Query = Score_feature (url_query,4,Eng_Finn_Combine_Stpwrds)
                Title = Score_feature (title,4,Eng_Finn_Combine_Stpwrds)
                Bold = Score_feature(bold_italic,2,Eng_Finn_Combine_Stpwrds)
                
                #6 Go through all the 100 candidate 100 words
                Cand_Words_Score_dic ={}
                
                for cand_word in Cand_100_Words_list:
                    
                    #7 Check_values for null if not null score for cand words
                    H1_Score ,H2_Score ,H3_Score,URL_Host_Score,URL_Query_Score,Title_Score = Check_value(cand_word,h1,h2,h3,host,Query,Title)#no_syn_words,B)
                    Final_feature_Score0 = round (H1_Score + H2_Score + H3_Score)
                    Final_feature_Score1 = round (H1_Score + URL_Host_Score + Title_Score)
                    Final_feature_Score2 = round (H1_Score + URL_Host_Score + Title_Score + URL_Query_Score)
                    Final_feature_Score3 = round (URL_Host_Score + Title_Score + URL_Query_Score)
                    Final_feature_Score4 = round (H1_Score + H2_Score + H3_Score + URL_Host_Score + URL_Query_Score + Title_Score , 2)
                        
                   
                    #9 Store all cand 100 words and their features scores in dictionary        
                   
                    
                    Cand_Words_Score_dic[cand_word] =   Final_feature_Score1
              
                
                
                #10 Counts the dictinary to get top 10 words
                Counts_Final_Features_Scores = Counter(Cand_Words_Score_dic)
                keywords =[]
                # 11 set number of keywords in case of mopsi 5
                Number_of_keywords = 10
                    
                
                for word, fr in Counts_Final_Features_Scores.most_common(Number_of_keywords):
                    keywords.append(word)
                
                
                #11 return the keywords for base method
                    
                return (keywords)   
                
            ###########################################################################
            def Score_in_Feature(candidate_word,feature_score_dic):
                New_feature_dic ={}
                for word in candidate_word:
                    Feature_Score = Check_null(word, feature_score_dic)
                    New_feature_dic[word] = Feature_Score
                Counts_Final_Features_Scores = Counter( New_feature_dic)
                Number_of_keywords = 10
                keywords =[]
                for word, fr in Counts_Final_Features_Scores.most_common(Number_of_keywords):
                    keywords.append(word)
                
                return(keywords)
            if __name__ == "__main__":    
                URL ="http://bbc.com"    
                Text, HTML =Web_Funtion(URL)
                Keywords  = Extract_keywords_Base(URL,Text, HTML,False)
                print (Keywords)
            
          
          
            
              Output Section
               Keywords
            bbc homepage dotcom window new the bbcdotcom
             uk eurovision ads
              
              Ends all methods 
            
          
        
      
      
    Updated 10.5.2023

Keyword Extraction for Webpages With Python Code

Machine Learning Group School of Computing University of Eastern Finland

Author Himat Shah (2018-2022)

About Keyword Extraction

HRANK

(1) Introduction

(2) Methodology

(3) Python Implementation

Import packages

(1) Extract Text of Webpage

(2) Preprocessing Text

Parts of Speech Tagger (POS)

WordNet semantic similarity

Clustering

Calling Hrank function

Output Section Hrank

Drank

(1) Introduction

(2) Methodology

(3) Python Implementation

(4) Outpu Section

(1) Extract Text,Raw HTML Webpage Output

Raw Text

Text of Webpage

(2) Language Detection

(3) Preprocessing Text Output

(4) Candidate Keywords

(5) Feature Formation

(6) Feature Score Output

(7) Final Keywords Output

Result Drank Ends

WebRank

(1) Introduction

(2) Methodology

(3) Python Implementation

ACI-Rank

(1) Introduction

(2) Methodology

(3) Python Implementation

Output Section