Keyword Extraction for Webpages With Python Code

Machine Learning Group School of Computing University of Eastern Finland

Author Himat Shah (2018-2022)


About Keyword Extraction

What are keywords or keyphrases?

A keyword is a single word, or a sequence of words (keyphrase) in the text that provide concise, high-level description of the content to readers. The extraction of keywords is a fundamental step in text summarization, information retrieval, topic model construction, clustering, and advertising systems. It is common to use keywords or keyphrases interchangeably, but researchers typically define a keyword as a single word and a keyphrase as a group of words.

Google defnes a keyword as an isolated word or phrase that provides concise high-level information about content to readers . With the increasing amount of data, users need more resources and time to understand content. Keywords make it easier to understand the meaning of a text in fewer words.

Usages of keywords

Keywords summarize the key points presented in the text. When searching for information on search engines, keywords play a signifcant role in fnding relevant content. Keywords are the mostinformative part of a text; they are the most prominent words in the text and describe its content. Keywords are necessary in situations involving huge amounts of text data that need to be processed automatically. Keywords are widely used in document summarization, indexing, categorization, and clustering of huge datasets. Many scientifc publications contain keyword lists that have been explicitly assigned by their authors. Other documents, however, have not been assigned keywords .

Keywords ofer readers a concise high-level summary of a documents content, thereby improving their understanding of that text. Keywords are the most relevant and important indicator for users seeking to grasp the fundamentals of a topic when scanning or skimming an article. Keyword extraction is a basic step in many text-mining and natural language processing (NLP) techniques, including text summarization, information retrieval, topic modeling, clustering, and content-based advertisement systems.

Isuues

As webpages are constantly updated, it is difcult to create keywords manually. Manual keyword assignment is labor intensive, time consuming, and error prone.

Specialized curators use fxed taxonomies for manual keyword generation, but in some cases, the keywords chosen by the author are not sufciently comprehensive and accurate. Without high-quality keywords, users fail to catch relevant information . Finding the relevant webpages, a user is seeking is often a challenging task for which representative keywords or keyphrases.

The majority of existing keyword extraction methods use language-dependent Natural Language Processing (NLP) based techniques, including Part-of-Speech (POS) tagging, stemming, and lemmatization, which makes it complex to generalize a method for different languages.

So far, studies on language-independent approaches have been limited because they usually perform worse than methods that take advantage of linguistic features. Extracting keywords from web documents involves two main challenges: the first is the presence of noise and irrelevant data such as navigation bars, menus, comments and advertisements and the second consists in the presence of multiple topics and multiple languages. Therefore, it is very important to have a general keyword extraction method that can extract keywords without relying on any specific language.

Sindh

Fig.1 Show challanges of multiple languages in a webpage

We try to address the challenges of keyword extraction by developing and testing four new techniques, both language-dependent and language-independent as well as supervised and unsupervised. In our work special attention is paid to finding the most relevant features for identifying good keyword and keyphrase candidates. The main purpose of the research is to extract only those language-independent features of web pages in order to find a method that can be applied in different languages.

The work deals with statistical, linguistic, and structural features as well as their combinations. This diversity of approaches serves the pragmatic overall goal of finding the best available methods by assessing the relative performances of the newly developed as well as existing methods on a number of different datasets.

What are the key findings or observations of this research?

In this research work, we have developed three language-independent and one language-dependent method to extract keywords from webpages. Most existing methods rely on Natural Language Processing (NLP) techniques, including Part-of-Speech (POS) tagging, stemming, and lemmatization, which are language-dependent and makes it difficult to generalize the method to other languages. This research aims to find a method that can be applied to webpages regardless of their language, by extracting only language-independent features.

It is challenging to extract keywords from web documents for two reasons: the first is the presence of navigation bars, menus, comments, and advertisements. The second is the presence of multiple topics and multiple languages in the same page. It is therefore important to have a general keyword extraction without having to rely on a particular language.

For this purpose the author proposes four new automatic keyword extraction methods for webpages: Hrank, D-rank, WebRank, and ACI-rank

HRANK

In this section we will discuss Hrank four different sections:
(1) Introduction
(2) Methodology
(3) Python Implementation
(4) Output Section

(1) Introduction

We study the importance of the distribution of semantically similar POS tags, such as nouns, adjectives, and verbs in the extraction of relevant keywords from the web page

  • A new keywords extraction method that requires a minimum knowledge of DOM structure.
  • A simple measure TF performs better than the more complex methods.
  • However, the combination of nouns, adjectives, and verbs improves performance when TF fails.
  • A new keywords extraction method that requires a minimum knowledge of DOM structure.
  • The proposed method outperforms CL-Rank, TextRank, and TF. A simple measure TF performs better than the more complex methods. However, the combination of nouns, adjectives, and verbs improves performance when TF fails.
  • (2) Methodology

    Fig. 1. presents the workflow of the proposed keywords extraction method. The method has two modules: (1) pre-processing and (2) keyword extraction. The pre-processing module involves the extraction of the natural language text from the web page. The keyword extraction module utilizes the text from the pre-processing module. In the pre-processing module, the first three functions involve the filtering of the text from all the other content of a web page. All the content of a web page is extracted using a document object model (DOM) and X-path function. The text that belongs to javascript scripting language and cascade style sheets is eliminated in the text filtering function. The special characters, such as @,*,£, or $, punctuation marks, and numbers are also filtered out using the regular expression in the text filtering function. Similarly, the text filtering function also involves the removal of the stop words from the text. The stop words are the natural language words that have minimal or no meaning, such as and, the, a, and an. The filtered text can now be utilized for natural language processing. The POS extractor, normalize text, and separate POS functions involve the natural language processing on the filtered text. The POS extractor function divides the text into tokens. A token is a whitespace-separated unit of text. The tokens are assigned the POS tags, such as nouns, adjectives, and verbs.

    Dranks

    Fig. 1. Workflow of Drank

    The tokens with POS tags are further normalized. The normalization is the process of replacing the inflected forms of a word with the root word. The inflected form represents the different usage of a word in the sentences. For example, finds, finding, and found are the inflected forms of the word find. An inflected form of a word has a changed spelling or ending. In natural language processing, the lemmatization is used to find the inflected form of the words with different spellings, such as finds and found for the word find in the above example. Unlike lemmatization, the stemming process takes care of the prefixes and suffixes to find the root word, such as finding in the abovementioned example. The output of the normalization process is the tokens with all the inflected forms replaced with their root word. The lists of the POS-tagged tokens are provided to the separate POS function, which separates the tokens into the lists of nouns, adjectives, and verbs. The lists are provided to the count frequency function. The count frequency function calculates the frequency of the words in the separate lists having nouns, adjectives, and verbs. The top-frequent tokens are selected as candidate keywords. The semantically similar words among top-frequent tokens are grouped together using a lexical database, named as WordNet. The lexical database helps in finding the synsets of the words. The synset is a set of one or more synonyms that can be used interchangeably in some context [20].

    We compute the semantic similarity of two different words using path-similarity, which is based on the WordNet [21]. The words that have no synonyms in the WordNet are removed from the lists. The path-similarity metric calculates the score between two different words in terms of their relatedness. We use path-similarity because it is very simple and it operates based on a parent-child relationship like a tree. Therefore, it is more convenient to use in our case. Three similarity matrices are created independently for the nouns, adjectives, and verbs. The similarity matrices are utilized in clustering the related words. We use an agglomerative clustering to find similar words in the lists. The clusters are scored by counting the frequencies of all the words in each cluster. The clusters are ranked according to the scores.

    (3) Python Implementation

    1. Extract Text
    2. Preprocess Text
    3. POS Tags Seperation
    4. Word Net semantic similarity
    5. Cluster Words
    6. Keyword Ranking and Selection

    
                  
                  

    Import packages

    Imports import urllib import nltk import sys import re import lxml import math import string import textwrap import requests from nltk.corpus import stopwords from bs4 import BeautifulSoup from nltk import word_tokenize from nltk.stem import WordNetLemmatizer from collections import defaultdict,Counter from nltk.corpus import stopwords from collections import defaultdict from bs4.element import Comment from nltk import wordpunct_tokenize from urllib.parse import urlparse import pandas as pd import numpy as np Common_Nouns ="january debt est dec big than who use jun jan feb mar apr may jul agust dec oct nov sep dec product continue second secodns".split(" ") URL_CommnWords =['','https','www','com','-','php','pk','fi','http:','http'] URL_CommonQueryWords = ['','https','www','com','-','php','pk','fi','https:','http','http:','html','htm'] UselessTagsText =['html','style', 'script', 'head', '[document]','img']

    (1) Extract Text of Webpage

    def Scrapper1(element): if element.parent.name in [UselessTagsText]: return False if isinstance(element, Comment): return False return True def Scrapper2(body): soup = BeautifulSoup(body, 'lxml') texts = soup.findAll(text=True) name =soup.findAll(name=True) visible_texts = filter(Scrapper1,texts) return u" ".join(t.strip() for t in visible_texts) def Scrapper3(text): lines = (line.strip() for line in text.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) return u'\n'.join(chunk for chunk in chunks if chunk) def Scrapper_title_4(URL): req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"}) con = urllib.request.urlopen(req) html= con.read() title=[] soup = BeautifulSoup(html, 'lxml') title.append(soup.title.string) return(title,urls) def Web_Funtion(URL): req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"}) con = urllib.request.urlopen(req) html= con.read() Raw_HTML_Soup = BeautifulSoup(html, 'lxml') raw =Scrapper2(html) Raw_text = Scrapper3(raw) return(Raw_text,Raw_HTML_Soup)

    (2) Preprocessing Text

    def Preprocessing_Text(Raw_text): # 1 making text as a space seperated word list stopwords_nltk = set(stopwords.words("English")) Words_in_text =[] for word in Raw_text.split(): Words_in_text.append(word) #2 remove numbers and special charactes from words alphawords_only = [word for word in Words_in_text if word.isalpha()] #3 removing length 1 words Words_afterRemoval_onelength = [word for word in alphawords_only if len(word)>1] #4 lower case all words lower_case_only = [word.lower() for word in Words_afterRemoval_onelength ] # Remove stopwords words_withoutStopwords = [word for word in lower_case_only if word not in stopwords_nltk] #removing words from common nouns like thank, use, gift, close words_withoutCommonNouns = [word for word in words_withoutStopwords if word not in Common_Nouns ] #return list of preprocess words return (words_withoutCommonNouns) #
    def Calc_words_frequency(Text_words): Sorted_WordCount_dict ={} word_and_fr_list=[] Count_fr = Counter(Text_words) for word,word_count in Count_fr.most_common(): word_and_fr_list.append([word, word_count]) Sorted_WordCount_dict[word]= word_count return(Sorted_WordCount_dict)

    Parts of Speech Tagger (POS)

    def POS_seperator(Text): adj=[] verb=[] nouns=[] for line in Text: tokens = nltk.word_tokenize(line) tagged = nltk.pos_tag(tokens) for x,y in tagged: if y in ['NNP','NNPS','NNS','NN']: nouns.append(x) if y in ['JJ', 'JJR', 'JJS']: adj.append(x) if y in ['VB','VBD','VBG','VBN','VBP']: verb.append(x) return (nouns,adj,verb) def Count_frequencies_for_POS(N,POS_text): Word_only=[] Word_frequency_only=[] words_and_freq = Counter(POS_text) for word,counts in words_and_freq.most_common(N): Word_only.append(word) Word_frequency_only.append(counts) return(Word_only,words_and_freq,Word_frequency_only)

    WordNet semantic similarity

    def Get_Synsets_Score (most_frequent_40_nouns): words_list_with_synsets=[] word_list_without_synsets =[] for word in most_frequent_40_nouns: a1 =wn.synsets(word) if len(a1) > 0: words_list_with_synsets.append(word) else: word_list_without_synsets.append(word) return (words_list_with_synsets,word_list_without_synsets)

    Clustering

    def Get_Clusters(fr,t6,clusters_to_write): #f = open(clusters_to_write,'w', encoding="utf8") simstr = "" wordlist = [] dm = [] for i in t6: a1 =wn.synsets(i) a2 =(a1[0]) dm.append([]) wordlist.append(i) for x in t6: b1 =wn.synsets(x) b2 =([b1][0][0]) wup1 =a2.wup_similarity(b2) if wup1 is None: simstr+="0.0 " dm[-1].append(1.0) continue dm[-1].append(1.0-wup1) simstr += str(wup1)+" " simstr += "\n" num_clusters=8 agg = sklearn.cluster.AgglomerativeClustering(n_clusters=num_clusters, affinity='precomputed',linkage="complete") cluster_labels=agg.fit_predict(dm) k=[] m =[] d=[] for i in range(num_clusters): for j in range(len(cluster_labels)): if cluster_labels[j] == i: k.append(["cluster",i]) k.append(t6[j]) k.append(fr[t6[j]]) m.append(k) d.append(m) keywords = [] clusters = {} for i in range(num_clusters): clusters[i] = {} clusters[i]['clusterSize'] = 0 clusters[i]['items'] = [] clusterSizes = [0] * num_clusters for i in range(len(cluster_labels)): clusters[cluster_labels[i]]['clusterSize'] += fr[t6[i]] clusters[cluster_labels[i]]['items'].append([t6[i], fr[t6[i]]]) clusterSizes[cluster_labels[i]] += fr[t6[i]] maxClusterSize=max(clusterSizes) maxFrequency = fr[max(fr, key=fr.get)] for i in range(num_clusters): if clusters[i]['clusterSize'] < maxClusterSize*0.3: continue keywords.append(clusters[i]['items'][0][0]) for word in clusters[i]['items'][1:-1]: if word[1] > 3 and word[1] > 0.2*maxFrequency: keywords.append(word[0]) return(keywords)

    Calling Hrank function

    def Hrank(URL): (Raw_text,Raw_HTML_Soup) =Web_Funtion(URL) preprocess_TextWords = Preprocessing_Text(Raw_text) text_length = len(preprocess_TextWords) (nouns,adjectives,verbs) = POS_seperator(preprocess_TextWords) #get top frequent 40 nouns and 2 adjectives and 1 verb length_nouns = len(nouns) preprocess_TextWords = Preprocessing_Text(Raw_text) text_length = len(preprocess_TextWords) words_count_dic = Calc_words_frequency(preprocess_TextWords) (nouns,adjectives,verbs) = POS_seperator(preprocess_TextWords) #get top frequent 40 nouns and 2 adjectives and 1 verb length_nouns = len(nouns) (most_frequent_40_nouns,frequencies_nouns,counts_nouns)= Count_frequencies_for_POS(40,nouns) (Adjectives_two, frequencies_adjectives, count_adjective)= Count_frequencies_for_POS(2,adjectives) (Verb_one,frequencies_verb,count_verb) = Count_frequencies_for_POS(1,verbs) # two seperate list Based on the WordNet (words_list_with_synsets, word_list_without_synsets)= Get_Synsets_Score(most_frequent_40_nouns) # keywords = Get_Clusters(frequencies_nouns, words_list_with_synsets,"cluster.txt") keywords_combine =list( keywords + Adjectives_two + Verb_one) return keywords_combine if __name__ == "__main__": URL ="http://bbc.com" Keywords = Hrank(URL) print (Keywords)
    Hrank Python Implementation ends

    Output Section Hrank

    Hrank Keywords
    
                    home, uk, world, pictures, watch, reel, top, guide,
                     premier, king, royal, coronation, sport, living, event,
                     travel, victory, technology, arrest, business, culture, news, return
                    

    Drank

    Drank section comprises of four sections:
    (1) Introduction (2) Methodology: (3) Python Implementation (4) Output Section

    (1) Introduction

    Work deals with webpage keyword extraction, which is crucial for the information retrieval task performed by search engines browsing through the internet. As such, keyword extraction is a specific kind of information extraction task, where the use of a natural language, or even several languages, poses severe challenges. To conquer these challenges, appropriate natural language processing (NLP) techniques have to be applied. As the method is dealing with webpages, the task is further complicated by the varying structure and layout of the pages. Even if Google search is widely and successfully used by a vast number of people for all so many purposes, the search results are often far from optimal, and processing natural language documents remain challenging.

    Fig.3 Shows advertisement on a webpage

    The D-rank method is an unsupervised method where the candidate keywords were ranked based on their position in the content after extracting their features from the DOM structure. The author tested the proposed method on a dataset of webpages in three languages: English, Finnish, and German.

    (2) Methodology

    Dranks

    Fig. 4 Shows workflow of Drank method

    (3) Python Implementation

    1. Extract Text
    2. Preprocess Text
    3. POS Tags Seperation
    4. Word Net semantic similarity
    5. Cluster Words
    6. Keyword Ranking and Selection

     Imports
      import urllib
      import nltk
      import sys
      import re 
      
      import lxml
      import math
      import string
      import textwrap
      import requests
      
      from nltk.corpus import stopwords
      from bs4 import BeautifulSoup
      from nltk import word_tokenize
      from nltk.stem import WordNetLemmatizer
      from collections import defaultdict,Counter
      from nltk.corpus import stopwords
      from collections import defaultdict 
      from bs4.element import Comment
      
      from nltk import wordpunct_tokenize
      from urllib.parse import urlparse 
      
      import pandas as pd 
      import numpy as np
      import warnings
      warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
      
      Common_Nouns ="january debt est dec big than who use jun jan feb mar apr may jul agust dec oct ".split(" ")
      URL_CommnWords =['','https','www','com','-','php','pk','fi','http:','http']
      URL_CommonQueryWords = ['','https','www','com','-','php','pk','fi','https:','http','http:]
      UselessTagsText =['html','style', 'script', 'head',  '[document]','img']
      def Scrapper1(element):
          if element.parent.name in [UselessTagsText]:
              return False
          if isinstance(element, Comment):
              return False
          return True
      
      def Scrapper2(body):             
          soup = BeautifulSoup(body, 'lxml')      
          texts = soup.findAll(text=True)   
          name =soup.findAll(name=True) 
          visible_texts = filter(Scrapper1,texts)        
          return u" ".join(t.strip() for t in visible_texts)
      
      def Scrapper3(text):                  
          lines = (line.strip() for line in text.splitlines())    
          chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
          return u'\n'.join(chunk for chunk in chunks if chunk)
      
      
      def Scrapper_title_4(URL):
        req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
        con = urllib.request.urlopen(req)
        html= con.read()
        title=[]
        
        soup = BeautifulSoup(html, 'lxml') 
        title.append(soup.title.string)
        return(title,urls)
      
      def Web_Funtion(URL):
        req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
        con = urllib.request.urlopen(req)
        html= con.read()  
        Raw_HTML_Soup = BeautifulSoup(html, 'lxml') 
       
        raw =Scrapper2(html)
        Raw_text = Scrapper3(raw) 
        return(Raw_text,Raw_HTML_Soup) 
      
      
      def _calculate_languages_ratios(text):  
          languages_ratios = {}
          tokens = wordpunct_tokenize(text)
          words = [word.lower() for word in tokens]    
          for language in stopwords.fileids():
              stopwords_set = set(stopwords.words(language))
             
              words_set = set(words)
              common_elements = words_set.intersection(stopwords_set)
      
              languages_ratios[language] = len(common_elements) 
          return languages_ratios
      
      
      
      def detect_language(text):
          ratios = _calculate_languages_ratios(text)
          most_rated_language = max(ratios, key=ratios.get)
          stop_words_for_language = set(stopwords.words(most_rated_language))
          return most_rated_language,stop_words_for_language
      
      def Preprocessing_Text(Raw_text, stop_words_for_language):
          
          # 1 making text as a space seperated word list
          stop_words_for_language = str(stop_words_for_language).lower()
          Words_in_text =[]
          for word in Raw_text.split():                    
              Words_in_text.append(word)
      
          
           #2 remove numbers and special charactes from words
              
          alphawords_only = [word for word in Words_in_text if word.isalpha()]          
          
          #3 removing length 1 words
          
          Words_afterRemoval_onelength = [word for word in alphawords_only if len(word)>1]
      
          #4 lower case all words
          
          lower_case_only = [word.lower() for word in Words_afterRemoval_onelength ]
          
          # Remove stopwords 
          
          stopwords_nltk = set(stopwords.words("English"))  
          words_withoutStopwords = [word for word in lower_case_only if word not in stopwords_nltk]
          if stop_words_for_language != "english":
              words_withoutStopwords = [word for word in words_withoutStopwords if word not in stop_words_for_language]
          
          #removing words from common nouns like thank, use, gift, close
          
          words_withoutCommonNouns = [word for word in words_withoutStopwords if word not in Common_Nouns ]
          
          #return list of preprocess words
          
          return (words_withoutCommonNouns)
      
      
    def Calc_words_frequency(Text_words): Sorted_WordCount_dict ={} word_and_fr_list=[] Count_fr = Counter(Text_words) for word,word_count in Count_fr.most_common(): word_and_fr_list.append([word, word_count]) Sorted_WordCount_dict[word]= word_count return(Sorted_WordCount_dict)
    #FEATURES
    def Function_ParseURL(URL): URL =str(URL) host=[] obj=urlparse(URL) name =(obj.hostname) if len(name)>0: for x in name.split('.'): if x.lower() not in URL_CommonQueryWords: host.append(x) else: host.append(name) path=[] host_part_URL =[] for url_parts in URL.split('/'): for url_part in url_parts.split('.'): if (len(url_part)>0): for url_words in url_part.split('-'): if url_words.lower() not in URL_CommnWords and url_words.lower() not in host: path.append(url_words.lower()) else: path.append(url_parts) return(host,path)
    def function_TexDic_Filter(Tag_TextDic): alt_words=[] if len(Tag_TextDic) > 0: for k,i in Tag_TextDic.items(): for x in i: word=[n for n in x.split(',')] for x in word: words=[i for i in x.split() ] for x in words: alt_words.append(x) return(alt_words) else: return(alt_words) def function_Tag_Text(Raw_HTML_Soup,Tag_name): TagTextList=[] for text in Raw_HTML_Soup.find_all(Tag_name): tag_text = text.text.strip().lower() TagTextList.append(tag_text) return TagTextList def function_HeaderTitleAnchorText(Raw_HTML_Soup): H1_TextList = function_Tag_Text(Raw_HTML_Soup,'h1') H2_TextList = function_Tag_Text(Raw_HTML_Soup,'h2') H3_TextList= function_Tag_Text(Raw_HTML_Soup,'h3') H4_TextList = function_Tag_Text(Raw_HTML_Soup,'h4') H5_TextList = function_Tag_Text(Raw_HTML_Soup,'h5') H6_TextList = function_Tag_Text(Raw_HTML_Soup,'h6') Title_TextList = function_Tag_Text(Raw_HTML_Soup,'title') Anchor_TextList = function_Tag_Text(Raw_HTML_Soup,'a') return (H1_TextList,H2_TextList,H3_TextList,H4_TextList,H5_TextList,H6_TextList,Title_TextList,Anchor_TextList) def function_MakeDictTagText(Raw_HTML_Soup): (H1_TextList,H2_TextList,H3_TextList,H4_TextList,H5_TextList,H6_TextList,Title_TextList,Anchor_TextList) = function_HeaderTitleAnchorText(Raw_HTML_Soup) H1_TextDict = {} H2_TextDict = {} H1_TextDict = {} H3_TextDict = {} H4_TextDict = {} H5_TextDict = {} H6_TextDict= {} Title_TextDict = {} Anchor_TextDict = {} H1_TextDict["h1"] = H1_TextList H2_TextDict["h2"] = H2_TextList H3_TextDict["h3"] = H3_TextList H4_TextDict["h4"] = H4_TextList H5_TextDict["h5"] = H5_TextList H6_TextDict["h6"] = H6_TextList Title_TextDict["title"] = Title_TextList Anchor_TextDict["a"] = Anchor_TextList H1_dic = function_TexDic_Filter(H1_TextDict) H2_dic = function_TexDic_Filter(H2_TextDict) H3_dic = function_TexDic_Filter(H3_TextDict) H4_dic = function_TexDic_Filter(H4_TextDict) H5_dic = function_TexDic_Filter(H5_TextDict) H6_dic = function_TexDic_Filter(H6_TextDict) Title_dic = function_TexDic_Filter(Title_TextDict) Anchor_dic = function_TexDic_Filter(Anchor_TextDict) return (H1_dic, H2_dic, H3_dic, H4_dic, H5_dic, H6_dic, Title_dic, Anchor_dic)
    def Feature_Score(candidate_word,feature_words,score): total_score=0 score_single_time =0 for word_feature in feature_words: if word_feature ==candidate_word: #total_score+=score score_single_time = score return(score_single_time) def Tf_Score(fr,text_length): if text_length<50: tf_score =((fr/100)*50) else: tf_score=((fr/100)*20) return (tf_score) #
    def function_word_Fr_TagName_ScoreDic(words_count_dic, text_length,Raw_HTML_Soup): wrd_fr_Tgs_Fnl_score =defaultdict() Word_Final_Score =defaultdict() Host_part_of_URL, Query_part_of_URL = Function_ParseURL(URL) #names of features 10 Name_FeaturesList =np.array(['H1', 'H2', 'H3','H4', 'H5', 'H6','Title','Anchor','URL-H','URL-Q']) # Manual score for words Manual_Score_Each_Features =np.array([6, 5, 4,3, 2, 2, 6, 1,5,4]) # Get all the words in features (H1_dic, H2_dic, H3_dic, H4_dic, H5_dic, H6_dic, Title_dic, Anchor_dic)= function_MakeDictTagText(Raw_HTML_Soup) featuresText_allDict_npArrayList = np.array([H1_dic, H2_dic, H3_dic, H4_dic, H5_dic, H6_dic, Title_dic, Anchor_dic, Host_part_of_URL, Query_part_of_URL]) for word,fr in words_count_dic.items(): tf_score = Tf_Score(fr,text_length) tag =[] name_tag =[] for word_inAll_Dic in range (len(featuresText_allDict_npArrayList)): if word in featuresText_allDict_npArrayList[word_inAll_Dic]: tag.append(Manual_Score_Each_Features[word_inAll_Dic]) name_tag.append(Name_FeaturesList[word_inAll_Dic]) score= (sum(tag)) score = score + tf_score Word_Final_Score[word] = score wrd_fr_Tgs_Fnl_score[word] = fr,name_tag,score return (wrd_fr_Tgs_Fnl_score, Word_Final_Score)
    def function_Drank_KeywordExtraction(URL): Raw_text, Raw_HTML_Soup = Web_Funtion(URL) most_rated_language,stop_words_for_language = detect_language(Raw_text) preprocess_TextWords = Preprocessing_Text(Raw_text, stop_words_for_language ) text_length = len(preprocess_TextWords) words_count_dic = Calc_words_frequency(preprocess_TextWords) # Features H1_TextList, H2_TextList, H3_TextList, H4_TextList, H5_TextList, H6_TextList,Title_TextList,Anchor_TextList = function_HeaderTitleAnchorText(Raw_HTML_Soup) #Feature Header, Title, Anchor text, score dictionary (wrd_fr_Tgs_Fnl_score, Word_Final_Score) = function_word_Fr_TagName_ScoreDic(words_count_dic, text_length,Raw_HTML_Soup) Keywords =[] sorted_word_score = Counter(Word_Final_Score) for word,score in sorted_word_score.most_common(10): Keywords.append(word) return Keywords if __name__ == "__main__": URL ="http://bbc.com" Keywords = function_Drank_KeywordExtraction(URL)

    (4) Outpu Section


    (1) Extract Text,Raw HTML Webpage Output


    Raw Text

    
                          html
                          BBC - Homepage
                          window.orb_fig_blocking = true;
                          window.bbcredirection = {geo: true};
                          :root {
                          --bbc-font: ReithSans, Arial, Helvetica, freesans, sans-serif;
                          --bbc-font-legacy: Arial, Helvetica, freesans, sans-serif;
                          }
                          window.orbitData = {};
                          var additionalPageProperties = {};
                          additionalPageProperties['custom_var_1'] = 'international' || null;
                          additionalPageProperties['custom_var_9'] = '1' || null;
                          additionalPageProperties['experience_global_platform'] = 'orbit';
                          window.orbitData.userProfileUrl = "https://www.bbc.co.uk/userprofile";
                          window.page = {
                          name: 'home.page' || null,
                          destination: 'HOMEPAGE_GNL' || null,

    Text of Webpage


    Homepage Accessibility links Skip to content Accessibility Help BBC Account require(['idcta/statusbar'], function (statusbar) {new statusbar.Statusbar({id: 'idcta-statusbar', publiclyCacheable: true});}); Notifications Home News Sport Weather iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel Worklife Travel Future Culture TV Weather Sounds More menu
                            Search BBC
                            Search BBC
                            Home News Sport Weather iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel Worklife Travel Future Culture TV Weather Sounds Close menu
                            BBC Homepage
                           
                            Charles and Camilla crowned in historic Coronation
                            King Charles and Queen Camilla have been crowned on a day of pageantry, history - and downpours.
                            UK
                            Charles and Camilla crowned in historic Coronation
                            The story of Coronation day in extraordinary photos
                            News
                            The story of Coronation day in extraordinary photos
                            Prince Harry leaves alone after Coronation
                            UK
                            Prince Harry leaves alone after Coronation
                            Dozens of protesters arrested during Coronation
                            UK
                            Dozens of protesters arrested during Coronation
                            Real Madrid win first Copa del Rey since 2014
                            European Football
                            Real Madrid win first Copa del Rey since 2014

    (2) Language Detection


    English Language

    (3) Preprocessing Text Output


    BBC -Homepage Homepage Accessibility links Skip to content
                    Accessibility Help BBC Account Notifications Home News Sport Weather
                    iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel
                    Worklife Travel Future Culture TV Weather Sounds More menu Search
                    BBC Search BBC Home News Sport Weather iPlayer Sounds Bitesize
                    CBeebies CBBC Food Home News Sport Reel Worklife Travel Future
                    Culture TV Weather Sounds Close menu BBC Homepage Putin should be

    
                          Total words in a webpage : 9496 
                          Length of Words length after removing 1 length words: 1537        
                          Words length After numbers, structure removal:1349
                          Words after removing special characters:1319
                          After removing stopwords: 945
                          After removing common nouns length of Final text:922
                        

    (4) Candidate Keywords


    
                        BBC -Homepage Homepage Accessibility links Skip to content
                        Accessibility Help BBC Account Notifications Home News Sport Weather
                        iPlayer Sounds Bitesize CBeebies CBBC Food Home News Sport Reel
                        Worklife Travel Future Culture TV Weather Sounds More menu Search
                        BBC Search BBC Home News Sport Weather iPlayer Sounds Bitesize
                        CBeebies CBBC Food Home News Sport Reel Worklife Travel Future
                        Culture TV Weather Sounds Close menu BBC Homepage Putin should be
                        sentenced for 'criminal actions' - Zelensky The Ukrainian president
                        calls for the creation of a new war tribunal as he addresses The
                        Hague. Europe Putin should be sentenced for 'criminal actions' -
                        Zelensky 
                      

    (5) Feature Formation


                        
                        H1              
    
                        ['bbc homepage']   
    
                        H2          
    
                        ['accessibility links', '', 'news', 'sport', 'coronation of king
                        charles iii', 'london weather', 'editor’s picks', 'latest business
                        news', 'technology of business', 'advertisement', 'new tech
                        economy', 'featured video', 'bbc world service', 'more around the
                        bbc', 'from our correspondents', 'global trade', 'new tech economy',
                        'world in pictures', 'bbc in other languages', 'more languages',
                        'explore the bbc'] 
                        
                        H3
    ['us denies masterminding moscow drone attack', 'top us judge under fresh scrutiny over school fees', 'ed sheeran wins thinking out loud copyright case', 'what side-hustlers are really making', 'the true story of the kentucky derby', 'four proud boys guilty of seditious conspiracy', 'silence and teddies at scene of serbia school shooting', 'prince william and kate drop into a soho pub', 'serie a: napoli bidding to clinch title but go 1-0 down at udinese', 'a tour of a lost world, before football changed forever', 'premier league: brighton 0-0 man utd - rashford goes close to opener', 'what kings wore from tudor times to now', "the 'super-deep' diamonds in the crown", 'your full guide to how coronation day will unfold', 'thu', 'fri', 'sat', 'sun', 'a misunderstood ancient erotic manual', 'why do french men pee on the street?', 'the surprisingly deadly secret of the grapefruit', 'a regal scone made for king charles iii', 'do you own too many clothes?', 'can remote-work gossip backfire?', 'the rappers risking the death penalty', 'why the wicker man has divided opinion for 50 years', 'camilla: from tabloid target to crowned queen', "lizzo thanks 'king of flutes' for met gala duet", 'shell reports stronger than expected profits', 'us raises interest rates to highest in 16 years', 'investors sue over credit suisse collapse', 'china tourism rebounds above pre-pandemic levels', "branson feared 'losing everything' in pandemic", 'the revival of a historic italian fruit', 'the first climate-resilient nation?', 'a major positive climate tipping point', 'why there is serious money in kitchen fumes', 'the people turning time into a currency', "ukraine's first lady and pm's wife embrace outside no 10", "ukraine's first lady and pm's wife embrace...", 'how well does william pull a pint?', 'space trash floats away during spacewalk', "ros atkins on... the videos showing 'kremlin...", "russian media's muted response to kremlin...", 'inside hospital where oxygen runs out', 'which route will the king take to his', 'watch man 'spanish']
    H4 [EMPTY]
    H5 [EMPTY]
    H6 [EMPTY]
    Title [bbc - homepage]
    URL-Host [bbc, bbc.com]
    URL- Query []

    (6) Feature Score Output


                        +---------------+-----------+------------------------------------------+--------------------+
                  |      Word     | Frequency |                   TAGS                   |    Final-Score     |
                  +---------------+-----------+------------------------------------------+--------------------+
                  |      bbc      |     14    | ['H1', 'H2', 'Title', 'Anchor', 'URL-H'] |        25.8        |
                  |    function   |     10    |                    []                    |        2.0         |
                  |    weather    |     9     |          ['H2', 'H3', 'Anchor']          |        11.8        |
                  |      news     |     8     |             ['H2', 'Anchor']             |        7.6         |
                  |     sport     |     8     |             ['H2', 'Anchor']             |        7.6         |
                  |    business   |     8     |             ['H2', 'Anchor']             |        7.6         |
                  |     return    |     7     |                    []                    | 1.4000000000000001 |
                  |     watch     |     7     |             ['H3', 'Anchor']             |        6.4         |
                  |      home     |     6     |                ['Anchor']                |        2.2         |
                  |     sounds    |     6     |                ['Anchor']                |        2.2         |
                  |     travel    |     6     |                ['Anchor']                |        2.2         |
                  |     future    |     6     |                ['Anchor']                |        2.2         |
                  |    charles    |     6     |          ['H2', 'H3', 'Anchor']          |        11.2        |
                  |     typeof    |     5     |                    []                    |        1.0         |
                  |    worklife   |     5     |                ['Anchor']                |        2.0         |
                  |    culture    |     5     |                ['Anchor']                |        2.0         |
                  |    football   |     5     |             ['H3', 'Anchor']             |        6.0         |
                  |      tech     |     5     |          ['H2', 'H3', 'Anchor']          |        11.0        |
                  |    homepage   |     4     |        ['H1', 'Title', 'Anchor']         |        13.8        |
                  |      reel     |     4     |                ['Anchor']                |        1.8         |
                  |     denies    |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |     europe    |     4     |                ['Anchor']                |        1.8         |
                  |      top      |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |     school    |     4     |             ['H3', 'Anchor']             |        5.8         |
                  | entertainment |     4     |                ['Anchor']                |        1.8         |
                  |      arts     |     4     |                ['Anchor']                |        1.8         |
                  |     years     |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |    william    |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |     makes     |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |      lady     |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |      wife     |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |    russian    |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |     photos    |     4     |             ['H3', 'Anchor']             |        5.8         |
                  |   primitive   |     3     |                    []                    |        0.6         |
                  |     catch     |     3     |                    []                    |        0.6         |
                  |     event     |     3     |                    []                    |        0.6         |
                  |      set      |     3     |                    []                    |        0.6         |
                  | optimizelyurl |     3     |                    []                    |        0.6         |
                  | accessibility |     3     |             ['H2', 'Anchor']             |        6.6         |
                  |    iplayer    |     3     |                ['Anchor']                |        1.6         |
                  |    bitesize   |     3     |                ['Anchor']                |        1.6         |
                  |    cbeebies   |     3     |                ['Anchor']                |        1.6         |
                  |      cbbc     |     3     |                ['Anchor']                |        1.6         |
                  |      food     |     3     |                ['Anchor']                |        1.6         |
                  |       tv      |     3     |                ['Anchor']                |        1.6         |
                  |     attack    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    knowing    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |     guilty    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |   seditious   |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |   conspiracy  |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |   copyright   |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |     serie     |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |     napoli    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    udinese    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |      tour     |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    premier    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    brighton   |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |      iii      |     3     |          ['H2', 'H3', 'Anchor']          |        10.6        |
                  |      soho     |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |     royal     |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    diamonds   |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |     videos    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |     change    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |   technology  |     3     |             ['H2', 'Anchor']             |        6.6         |
                  |    branson    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    embrace    |     3     |             ['H3', 'Anchor']             |        5.6         |
                  |    science    |     3     |                ['Anchor']                |        1.6         |
                  |   hollywood   |     3     |             ['H3', 'Anchor']             |        5.6         |
                  

    (7) Final Keywords Output


     
                        bbc
                        coronation
                        homepage
                        charles
                        iii
                        global
                        tech
                        news
                        sport
                        business
                       

    Result Drank Ends



    WebRank

    In this WebRank section we we will discuss four main parts:
    (1) Introduction (2) Methodology: (3) Python Implementation (4) Output Section

    (1) Introduction

    (2) Methodology

    Underprocess

    Dranks

    (3) Python Implementation

    WebRank is implemented in two sections (1) Text words with features (2) Training and Testing

    (1)Text words with features

    1. Import packages
    2. Extract text of webpage
    3. Detect language of the text and provide stopwords
    4. Prprocess text
    5. Feature formation
    6. Creating words list in Excel file with 10 features
    7. (2)Training and Testing

      1. Bayes test and Bayes train
      2. KNN test and MLP train
      3. MLP test and KNN train
      4. SVM test and SVM train
      5. Decesion Tree test and Decesion Tree
      6. Max Scoring
      7. Common
                  
                  import sys
    from collections import defaultdict 
    import re 
    import numpy as np
    import lxml
    import sys
    import math
    import urllib
    import nltk
    from nltk import wordpunct_tokenize
    
    
    
    import sys
    from bs4 import BeautifulSoup
    from bs4.element import Comment
    from collections import defaultdict,Counter
    
    from nltk.corpus import stopwords
    STP_SET_ENG_NLTK = set(stopwords.words("english"))
    F_stopwords = set(stopwords.words("finnish"))
    url_unused_words = ['','https','www','com','-','php','pk','fi','https:','http','http:','html','htm']
    english_stop_words =[x for x in STP_SET_ENG_NLTK]
    finnish_stop_words =[x for x in F_stopwords]
    combine_stopwords = english_stop_words + finnish_stop_words
    
    def Scrapper1(element): if element.parent.name in ['html','style', 'script']: return False if isinstance(element, Comment): return False return True def Scrapper2(body): #text_from_html(body): soup = BeautifulSoup(body, 'lxml') texts = soup.findAll(text=True) name =soup.findAll(name=True) visible_texts = filter(Scrapper1,texts) return u" ".join(t.strip() for t in visible_texts) #raw =Scrapper2(html)#text def Scrapper3(text): lines = (line.strip() for line in text.splitlines()) # break into lines and remove leading and trailing space on each chunks = (phrase.strip() for line in lines for phrase in line.split(" "))# break multi-headlines into a line each return u'\n'.join(chunk for chunk in chunks if chunk)# drop blank lines def Scrapper_title_4(urls): req = urllib.request.Request(urls, headers={'User-Agent' : "Magic Browser"}) con = urllib.request.urlopen(req) html= con.read() title=[] soup = BeautifulSoup(html, 'lxml') title.append(soup.title.string) return(title,urls) def Web(urls): req = urllib.request.Request(urls, headers={'User-Agent' : "Magic Browser"}) con = urllib.request.urlopen(req) html= con.read() soup = BeautifulSoup(html, 'lxml') #keywordregex = re.compile('') raw =Scrapper2(html) clean_text=Scrapper3(raw) return(clean_text,soup)
    # Detect language and stopwords list def _calculate_languages_ratios(text): languages_ratios = {} tokens = wordpunct_tokenize(text) words = [word.lower() for word in tokens] for language in stopwords.fileids(): stopwords_set = set(stopwords.words(language)) words_set = set(words) common_elements = words_set.intersection(stopwords_set) languages_ratios[language] = len(common_elements) # language "score" return languages_ratios def detect_language(text): ratios = _calculate_languages_ratios(text) most_rated_language = max(ratios, key=ratios.get) stop_words_for_language = set(stopwords.words(most_rated_language)) return most_rated_language,stop_words_for_language def extract_stop_words(detected_language): stop_words =[] language_name = detected_language[0] for x in detected_language: for i in x: stop_words.append (i) return (language_name,stop_words)
    #Preprocess def Text_Clean(Text,stopwords): clean_text =[] k=[] filter_text = [x.lower().strip().replace('.','').replace('‘','').replace('"','').replace('\'','').replace('?','').replace(',','').replace('-','').replace(':','').replace('!','').replace('@','').replace(')','').replace('(','').replace('#','').replace('%','').replace('"','').replace('/','').replace('\\','').replace('~','').replace('’','').replace('”','').replace(';','').replace('–','').replace('\\','').replace(" ",'').replace('/n','').replace('\n','').replace('…','').replace('“','').strip() for x in Text.split()] for word in filter_text: [clean_text.append(x)for x in word.split() if x not in stopwords and len(x)>1 and x.isalpha()] return(clean_text) # Features Formations def Divide_Url(url): from urllib.parse import urlparse host=[] obj=urlparse(url) name =(obj.hostname) for x in name.split('.'): if x.lower() not in url_unused_words: host.append(x) return(host) def Divide_URL_HOST_QUERY (URL): path=[] host =Divide_Url(URL) for x in URL.split('/'): for i in x.split('.'): for d in i.split('-'): if d.lower() not in url_unused_words and d.lower() not in host: path.append(d.lower()) host_dic = COUNTER_DICT(host) path_dic = COUNTER_DICT(path) return(host_dic,path_dic) def get_text(soup,h): text=[] text2 =[] text_dic ={} for w in soup.find_all(h): h_text = w.text.strip() h_text =h_text.replace(':','') #change made here h_text =h_text.replace(',','') h_text =h_text.replace('|','') h_text =(h_text.lower()) #change made here for x in h_text.split('-'): text.append(x) if len(text)!=0: for x in text: word=[n.strip() for n in x.split(',')] for x in word: words=[i.strip() for i in x.split() ] for x in words: text2.append(x) text_dic = COUNTER_DICT(text2) return(text_dic) else: return(text_dic) def CHEK_NULL(word,dic): f =0 if len(dic)>=1: f = dic.get(word) else: f =0 if f is None: f =0 return (f) def Extract_headerAnchorTitle(soup): h1_d= get_text(soup,'h1') h2_d= get_text(soup,'h2') h3_d=get_text(soup,'h3') h4_d= get_text(soup,'h4') h5_d= get_text(soup,'h5') h6_d= get_text(soup,'h6') a_d= get_text(soup,'a') #alt tab or anchor title_d= get_text(soup,'title') #CALLing return(h1_d,h2_d,h3_d,h4_d,h5_d,h6_d,a_d,title_d)
    # Manual Score each Feature def GET_SCORE_EACH_FEATURE(word,h1_dic, h2_dic,h3_dic,h4_dic,h5_dic,h6_dic,A_dic,title_dic,URL_H_dic,URL_Q_dic): f1 = CHEK_NULL(word,h1_dic) f2 = CHEK_NULL(word,h2_dic) f3 = CHEK_NULL(word,h3_dic) f4 = CHEK_NULL(word,h4_dic) f5 = CHEK_NULL(word,h5_dic) f6 = CHEK_NULL(word,h6_dic) f7 = CHEK_NULL(word,A_dic) f8 = CHEK_NULL(word,title_dic) f9 = CHEK_NULL(word,URL_H_dic) f10 = CHEK_NULL(word,URL_Q_dic) return (f1,f2,f3,f4,f5,f6,f7,f8,f9,f10)
    def COUNTER_DICT(list_words): score_dic ={} if len (list_words)>=1: list_words = [x for x in list_words if x not in combine_stopwords and len(str(x))>1 and str(x).isalpha() ] word_count_dict ={} unique_list =[] [unique_list.append(x) for x in list_words if x not in unique_list] lngth_list = len(unique_list) counter_list = Counter(list_words) word_fr_dic ={} for word,fr in counter_list.most_common(): word_fr_dic[word]= fr for word in unique_list: fr = word_fr_dic.get(word) fr_word = fr/lngth_list score_dic[word]= fr_word return (score_dic) else: return ()
    def WebRank(URL): Text,HTML = Web(URL) detected_language = detect_language(Text) name,stop_words =extract_stop_words(detected_language) candidate_list = Text_Clean(Text,stop_words) candidate_dic= COUNTER_DICT(candidate_list) unique_candidate_list =[] [unique_candidate_list.append(x) for x in candidate_list if x not in unique_candidate_list if x not in STP_SET_ENG_NLTK and x not in stop_words and len(x)>1 and x.isalpha()] #Features URL_H_dic,URL_Q_dic = Divide_URL_HOST_QUERY(URL) h1_dic, h2_dic,h3_dic,h4_dic,h5_dic,h6_dic,A_dic,title_dic = Extract_headerAnchorTitle(HTML) # Column headers string="Word,Relative Frequency %,H1%,H2%,H3%,H4%,H5%,H6%,Anchor%,Title%,Url-Host,Url-Query,GT,web-id"; for word in unique_candidate_list: try: fr = candidate_dic.get(word) if fr is None or not fr: fr =0 f1,f2,f3,f4,f5,f6,f7,f8,f9,f10 = GET_SCORE_EACH_FEATURE(word,h1_dic, h2_dic,h3_dic,h4_dic,h5_dic,h6_dic,A_dic,title_dic,URL_H_dic,URL_Q_dic) f12 = 0 f11 =0 string+="\n"+word+","; string+=str(fr)+","; string+=str(f1)+","; string+=str(f2)+","; string+=str(f3)+","; string+=str(f4)+","; string+=str(f5)+","; string+=str(f6)+","; string+=str(f7)+","; string+=str(f8)+","; string+=str(f9)+","; string+=str(f10)+","; string+=str(f11)+","; string+=str(f12); except: continue return (string,Text) if __name__ == "__main__": URL ="http://bbc.com" word__plus_featuresScore , Text_webpage = WebRank(URL)

    (2)Features testing and Training Section

    (1) Common.py

    import math; def readData(filename): from numpy import genfromtxt import numpy as np data = genfromtxt(filename, delimiter=' ') labels=[]; webpageIndex=[]; features=[]; if(type(data[0]) is np.float64): features.append(data[:-2]); labels.append(data[-2]); webpageIndex.append(data[-1]); else: for i in range(0,len(data)): features.append(data[i][:-2]); labels.append(data[i][-2]); webpageIndex.append(data[i][-1]); return { "features":features, "labels":labels, "webpageIndices":webpageIndex }; def printStatistics(predicted,labels,indices): import numpy as np indicesSet=set(indices); resultString=""; oldIndex=-1; keywordIndices=""; binaryValues=""; for i in range(0,len(indices)): binaryValues=binaryValues+str(math.floor(predicted[i]))+"\n"; if(oldIndex!=indices[i]): # webpage id is indices[i] if(oldIndex!=-1): resultString=resultString+str(math.floor(oldIndex))+" "+keywordIndices+"\n"; if(predicted[i]==1): if(keywordIndices!=""): keywordIndices+=" "; keywordIndices+=str(i); if(oldIndex!=-1): if(predicted[i]==1): keywordIndices=str(i); else: keywordIndices=""; oldIndex=indices[i]; else: if(predicted[i]==1): if(keywordIndices!=""): keywordIndices+=" "; keywordIndices+=str(i); resultString=resultString+str(math.floor(oldIndex))+" "+keywordIndices+"\n"; return [resultString,binaryValues]; def save(model, filename): from sklearn.externals import joblib joblib.dump(model, filename) def load(filename): from sklearn.externals import joblib return joblib.load(filename) def getHighestProbabilities(scores,indices,top): import numpy as np indicesSet=set(indices); predicted=np.zeros(len(scores)); for val in indicesSet: scoresInSet=[]; minIndex=len(indices); for i in range(0,len(indices)): if(indices[i]==val): scoresInSet.append(scores[i]); minIndex=min(minIndex,i); ids=np.argsort(scoresInSet); marksInSet=np.zeros(len(scoresInSet)); adjustedTop=min(top,len(scoresInSet)); for i in range(0,len(marksInSet)): if(i>=len(marksInSet)-adjustedTop): marksInSet[ids[i]]=1; for i in range(0,len(indices)): if(indices[i]==val): if(marksInSet[i-minIndex]==1): predicted[i]=1; return predicted;

    (2) Max_Score.py

    # Predicts the top k(10) keywords # 3.7.2019: RM - Implemented def scoreEachWebpageKeywords(features,indices,top): import numpy as np predicted = np.zeros(len(features)); scores=np.zeros(len(features)); for i in range (0,len(features)): # Himat's scoring Drank method tfscore=0.5*features[i][0] if(features[i][11]>50): tfscore=0.2*features[i][0] scores[i]= min(1,features[i][1]) *6 scores[i]+=min(1,features[i][2]) *5 scores[i]+=min(1,features[i][3]) *3 scores[i]+=min(1,features[i][4]) *2 scores[i]+=min(1,features[i][5]) *2 scores[i]+=min(1,features[i][6]) *2 #anchor scores[i]+=min(1,features[i][7]) *1 #title scores[i]+=min(1,features[i][8]) *6; #url Host scores[i]+=min(1,features[i][9]) *5; #url Query scores[i]+=min(1,features[i][10])*4; scores[i]+= tfscore *1; predicted=common.getHighestProbabilities(scores,indices,top); return predicted; import common import sys # PARAMETERS fold=1; dataset="mopsi_services";#guardian,macworld,mopsi_services top=10; if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): top=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA + MODEL READING data=common.readData(inFile); # PREDICTING predicted=scoreEachWebpageKeywords(data["features"],data["webpageIndices"],top) # OUTPUT STATISTICS [resultString,binaryValues]=common.printStatistics(predicted,data["labels"],data["webpageIndices"]); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(resultString); f.close(); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(binaryValues); f.close();
    
        

    (3) KNN Training and Testing

    (3.1) KNN Training

    # Trains an KNN model # Stores it in a file Number of neighbors is parameter import common import sys from sklearn.neighbors import KNeighborsClassifier # PARAMETERS inFile="../training.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/knn.joblib"; k=2; #default if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): k=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA READING data=common.readData(inFile); # TRAINING model = KNeighborsClassifier(n_neighbors=k) model.fit(data["features"], data["labels"]) # SAVING MODEL common.save(model,outFile); print ("Model saved at %s" % (outFile));

    (3.2) KNN Test

    # Reads an KNN model from a file and performs prediction import common import sys from sklearn import svm from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GridSearchCV from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score # PARAMETERS inFile="../testing.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/knn.joblib"; top=10;#default if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): top=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA + MODEL READING data=common.readData(inFile); model=common.load(outFile); # PREDICTING predicted=model.predict(data["features"]) probabilities=model.predict_proba(data["features"]) import numpy as np scores=np.zeros(len(probabilities)); for i in range (0,len(probabilities)): scores[i]=probabilities[i][1]; newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top); # OUTPUT STATISTICS [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(resultString); f.close(); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(binaryValues); f.close();
    
                      

    (4) Bayes train and Test

    (4.1) Bayes Train

    # Trains a Bayes model and stores it in a file # Bayes Train import common import sys from sklearn.naive_bayes import GaussianNB # PARAMETERS output="../output/"; input="../input/"; inFile="train.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/bayes.joblib"; if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA READING data=common.readData(inFile); # TRAINING model = GaussianNB() model.fit(data["features"][:-1], data["labels"][:-1]) # # SAVING MODEL common.save(model,outFile); print ("Model saved at %s" % (outFile));

    (4.2) Bayes Test

    # Reads a Bayes model from a file and performs prediction import common import sys from sklearn import svm from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GridSearchCV from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score # PARAMETERS inFile="../testing.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/bayes.joblib"; top=10;#default if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): top=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA + MODEL READING data=common.readData(inFile); model=common.load(outFile); # PREDICTING predicted=model.predict(data["features"]) probabilities=model.predict_proba(data["features"]) import numpy as np scores=np.zeros(len(probabilities)); for i in range (0,len(probabilities)): scores[i]=probabilities[i][1]; newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top); # OUTPUT STATISTICS [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(resultString); f.close(); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(binaryValues); f.close();

    (5) MLP training and Testing

    (5.1) MlP training

    #MLP train and Test # Trains a MLP model and stores it in a file import common import sys from sklearn.neural_network import MLPClassifier # PARAMETERS output="../output/"; input="../input/"; inFile="train.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/mlp.joblib"; if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; L1=15; L2=15; # DATA READING data=common.readData(inFile); # TRAINING model = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(L1, L2), random_state=1) model.fit(data["features"][:-1], data["labels"][:-1]) # SAVING MODEL common.save(model,outFile); print ("Model saved at %s" % (outFile));

    (5.2)MLP testing

    # Reads a MLP model from a file and performs prediction import common import sys from sklearn import svm from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GridSearchCV # PARAMETERS inFile="../testing.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/mlp.joblib"; top=10;#default if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): top=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA + MODEL READING data=common.readData(inFile); model=common.load(outFile); # PREDICTING predicted=model.predict(data["features"]) probabilities=model.predict_proba(data["features"]) import numpy as np scores=np.zeros(len(probabilities)); for i in range (0,len(probabilities)): scores[i]=probabilities[i][1]; newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top); # OUTPUT STATISTICS [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(resultString); f.close(); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(binaryValues); f.close();

    (6) SVM

    (6.1)SVM train

    # Trains an SVM model and stores it in a file # grid optimization of parameters can be done # 3.7.2019: RM - Implemented import common import sys from sklearn import svm from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GridSearchCV # worry about this later def getOptimizedParameters(features, labels): import numpy as np C_range = np.logspace(-2, 10, 13) gamma_range = np.logspace(-9, 3, 13) param_grid = dict(gamma=gamma_range, C=C_range) cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42) grid = GridSearchCV(svm.SVC(), param_grid=param_grid, cv=cv) grid.fit(features[:-1], labels[:-1]) return grid; # leave it to 0 for now or study SVM parameter optimization optimize=0; # PARAMETERS inFile="../training.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/svm.joblib"; if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA READING data=common.readData(inFile); # OPTIMIZATION if(optimize==1): print ("Optimizing parameters"); parameters=getOptimizedParameters(data["features"],data["labels"]); C=parameters.best_params_["C"]; gamma=parameters.best_params_["gamma"]; score=parameters.best_score_; print("The best parameters are %s with a score of %0.2f" % (parameters.best_params_, parameters.best_score_)) else: # defaults C= 10.0; gamma=1000.0; # TRAINING model = svm.SVC(gamma=gamma, C=C, probability=True) model.fit(data["features"], data["labels"]) # SAVING MODEL common.save(model,outFile); print ("Model saved at %s" % (outFile));

    (6.2) SVM test

    # Reads an SVM model from a file and performs prediction import common import sys from sklearn import svm from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GridSearchCV # PARAMETERS inFile="../testing.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/svm.joblib"; top=10;#default if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): top=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA + MODEL READING data=common.readData(inFile); model=common.load(outFile); # PREDICTING predicted=model.predict(data["features"]) probabilities=model.predict_proba(data["features"]) import numpy as np scores=np.zeros(len(probabilities)); for i in range (0,len(probabilities)): scores[i]=probabilities[i][1]; newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top); # OUTPUT STATISTICS [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(resultString); f.close(); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(binaryValues); f.close();
    
                        

    (7) Decesion Tree (DT)

    (7.1) DT Train

    # Trains a Decision Tree model and stores it in a file # Number of neighbors is parameter # 15.11.2018: RM - Implemented # 3.12.2019: RM - Updated for Keywords import common import sys from sklearn import tree, export_graphviz # PARAMETERS output="../output/"; input="../input/"; inFile="train.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/dtree.joblib"; if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA READING data=common.readData(inFile); # TRAINING model = tree.DecisionTreeClassifier() model.fit(data["features"][:-1], data["labels"][:-1]) dot_data = export_graphviz(model, out_file=None, filled=True, rounded=True) pydot_graph = pydotplus.graph_from_dot_data(dot_data) pydot_graph.write_png('original_tree.png') pydot_graph.set_size('"5,5!"') pydot_graph.write_png('resized_tree.png') # SAVING MODEL common.save(model,outFile); print ("Model saved at %s" % (outFile));

    (7.2) DT test

    # Reads a Decision Tree model from a file and performs prediction import common import sys from sklearn import svm from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GridSearchCV # PARAMETERS output="../output/"; input="../input/"; inFile="test.txt"; outFile="/home/tko/himat/web-docs/machine_learning/classification/models/dtree.joblib"; top=10;#default if(len(sys.argv)>1): dataset=sys.argv[1]; if(len(sys.argv)>2): fold=int(sys.argv[2]); if(len(sys.argv)>3): top=int(sys.argv[3]); inFile="/home/tko/himat/web-docs/machine_learning/txt_files_datasets/"+dataset+"/testing_"+str(fold)+".txt"; # DATA + MODEL READING data=common.readData(inFile); model=common.load(outFile); # PREDICTING predicted=model.predict(data["features"]) probabilities=model.predict_proba(data["features"]) import numpy as np scores=np.zeros(len(probabilities)); for i in range (0,len(probabilities)): scores[i]=probabilities[i][1]; newPredicted=common.getHighestProbabilities(scores,data["webpageIndices"],top); # OUTPUT STATISTICS [resultString,binaryValues]=common.printStatistics(newPredicted,data["labels"],data["webpageIndices"]); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/testing_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(resultString); f.close(); outFile="/home/tko/himat/web-docs/machine_learning/classification/output/"+dataset+"/binary_"+str(fold)+".txt"; f= open(outFile,"w+"); f.write(binaryValues); f.close();

    (8) Generate Keywords from training and testing files

    
                        #php starts	
                        $datasets=["guardian","macworld","mopsi_services"];//guardian,mopsi_services,macworld
                        $tops=[10,10,5];//guardian,mopsi_services,macworld
                        
                        $outputDir = "/home/tko/himat/web-docs/machine_learning/classification/output";
                        $inputDir = "/home/tko/himat/web-docs/machine_learning/txt_files_datasets";
                        $classifiersDir = "/home/tko/himat/web-docs/machine_learning/classification/classifiers/";
                        
                        $classificationMethod="max_scoring.py";// means Drank
                        $classificationMethod="knn";
                        $neighbors=12;
                        $classificationMethod="dtree";
                        /*$classificationMethod="bayes";
                        $classificationMethod="mlp";
                        $classificationMethod="svm";*/
                        
                        for($d=0;$d< count($datasets);$d++){
                          $dataset=$datasets[$d]; 
                          $allResults=[];
                          for($k=1;$k<=5;$k++){
                            
                            if($classificationMethod=="knn"){
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k." ".$neighbors);
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                            }else if($classificationMethod=="dtree"){
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                            }else if($classificationMethod=="bayes"){
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                            }else if($classificationMethod=="mlp"){
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                            }else if($classificationMethod=="svm"){
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_train.py ".$dataset." ".$k);
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod."_test.py ".$dataset." ".$k." ".$tops[$d]);
                            }else{ //drank
                              $output = shell_exec('python3 '.$classifiersDir.$classificationMethod." ".$dataset." ".$k." ".$tops[$d]);
                            }
                            $results=readResults($outputDir."/".$dataset."/testing_".$k.".txt");
                            $keywords=readKeywords($inputDir."/".$dataset."/testing_kw_".$k.".txt"); // reads keywords from separate file
                            for($i=0;$i0){
                              $str.="\n";
                            }
                            for($j=0;$j0){
                                $str.=" ";
                              }
                              $str.=$results[$i][$j];
                            }
                          }
                          return $str;
                        }
                        
                      #PHP ends
    

    (9) Database setup

    #-php program #datasetup.py (Generate the testing and training data in txt in txt_files_datasets) php Starts // here we prepare training and testing data from raw information // 3.7.2019 : RM - Implemented // 28.10.2019: RM - updating to output keywords in separate files /* input file format Word,TF,H1,H2,H3,H4,H5,H6,Anchor,Title,URL-1,Url-2,Text Length,GT,webpag_id 0-Word: total words appear in the text of the webpage 1-TF: Term Frequency of the word how many times a particular word appearing 2:7-H1-H6:Header tags 8-Anchor: A text appeared inside Anchor tag 9-Title: title tag 10-Url-1: Host part of the url or main part of the url 11-Url-2: Query part of url after the / 12-Text Length: total no of words inside the webpage 13-GT:Ground truth matching words 14-Webpage_id:Represent unique id for each webpage */ // file is big, need extra memory ini_set('memory_limit', '15192M'); // read input $inputFileName="csv_files/mopsi_services_312.csv";//mopsi_services_414.csv,macworld_220.csv,guardian_412.csv $outputDirectory="mopsi_services";//mopsi_services,macworld,guardian $myfile = fopen($inputFileName, "r") or die("Unable to open file!"); $contents = fread($myfile,filesize($inputFileName)); fclose($myfile); // dividing into lines $lines=explode("\n",$contents); echo count($lines)." lines\n"; // grouping into webpages $pages=[]; $page=[]; $webId=-1; for($i=0;$i< count($lines);$i++){ $comp=explode(",",$lines[$i]); $webpag_id=trim($comp[14]); if($webpag_id!="" && $webpag_id!="webpag_id" && $webpag_id!="web-id"){/// HIMAT FIX if($webId!=$webpag_id){ $webId=$webpag_id; if(count($page)>0){ array_push($pages,$page); } $page=[]; } array_push($page,$lines[$i]); }else{ //ignoring some header repeating or empty line } } if(count($page)>0){ array_push($pages,$page); } echo count($pages)." pages\n"; // dividing into train / test $testingPercent=0.2; for($k=0;$k< 5;$k++){ $training=[]; $testing=[]; $lowThreshold=floor($testingPercent*$k*count($pages)); $highThreshold=floor($testingPercent*($k+1)*count($pages)); for($i=0;$i< count($pages);$i++){ if($i>=$lowThreshold && $i<$highThreshold){ array_push($testing,$pages[$i]); }else{ array_push($training,$pages[$i]); } } //echo count($training)." training\n"; //echo count($testing)." testing\n"; // generate the feature vector files $trainingFileName="txt_files_datasets/$outputDirectory/training_".($k+1).".txt"; $myfile = fopen($trainingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatFileContents($training)); fclose($myfile); $testingFileName="txt_files_datasets/$outputDirectory/testing_".($k+1).".txt"; $myfile = fopen($testingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatFileContents($testing)); fclose($myfile); // generate the keyword mapping files $trainingFileName="txt_files_datasets/$outputDirectory/training_kw_".($k+1).".txt"; $myfile = fopen($trainingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatKeywordFileContents($training)); fclose($myfile); $testingFileName="txt_files_datasets/$outputDirectory/testing_kw_".($k+1).".txt"; $myfile = fopen($testingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatKeywordFileContents($testing)); fclose($myfile); } function formatFileContents($pages){ $string=""; for($i=0;$i< count($pages);$i++){ for($j=0;$j< count($pages[$i]);$j++){ $line=$pages[$i][$j]; // generating feature vector $comp=explode(",",$line); $vector=[]; for($k=1;$k< count($comp);$k++){ // no need for word itself starting at 1 array_push($vector,trim($comp[$k])); } for($k=0;$k< count($vector);$k++){ $string.=$vector[$k]; if($k< count($vector)-1){ $string.=" "; }else{ $string.="\n"; } } } } return $string; } function formatKeywordFileContents($pages){ $string=""; for($i=0;$i< count($pages);$i++){ for($j=0;$j< count($pages[$i]);$j++){ $line=$pages[$i][$j]; // generating feature vector $comp=explode(",",$line); $vector=[]; array_push($vector,trim($comp[0])); for($k=0;$k< count($vector);$k++){ $string.=$vector[$k]; if($k< count($vector)-1){ $string.=" "; }else{ $string.="\n"; } } } } return $string; } PHP ends>

    (10) Test php file

    PHP starts // here we prepare training and testing data from raw information // 3.7.2019 : RM - Implemented // 28.10.2019: RM - updating to output keywords in separate files /* input file format Word,TF,H1,H2,H3,H4,H5,H6,Anchor,Title,URL-1,Url-2,Text Length,GT,webpag_id 0-Word: total words appear in the text of the webpage 1-TF: Term Frequency of the word how many times a particular word appearing 2:7-H1-H6:Header tags 8-Anchor: A text appeared inside Anchor tag 9-Title: title tag 10-Url-1: Host part of the url or main part of the url 11-Url-2: Query part of url after the / 12-Text Length: total no of words inside the webpage 13-GT:Ground truth matching words 14-Webpage_id:Represent unique id for each webpage */ // file is big, need extra memory ini_set('memory_limit', '15192M'); // read input//mopsi_services_414.csv,macworld_204.csv,guardian_412.csv $outputDirectory="combined"; $inputFileName="csv_files/guardian_402.csv"; $myfile = fopen($inputFileName, "r") or die("Unable to open file!"); $contents = fread($myfile,filesize($inputFileName)); fclose($myfile); // dividing into lines $lines1=explode("\n",$contents); echo count($lines1)." lines\n"; $inputFileName="csv_files/macworld_204.csv"; $myfile = fopen($inputFileName, "r") or die("Unable to open file!"); $contents = fread($myfile,filesize($inputFileName)); fclose($myfile); // dividing into lines $lines2=explode("\n",$contents); echo count($lines2)." lines\n"; $inputFileName="csv_files/mopsi_services_312.csv"; $myfile = fopen($inputFileName, "r") or die("Unable to open file!"); $contents = fread($myfile,filesize($inputFileName)); fclose($myfile); // dividing into lines $lines3=explode("\n",$contents); echo count($lines3)." lines\n"; $lines=[]; for($i=0;$i< count($lines1);$i++){ array_push($lines,$lines1[$i]); } for($i=0;$i< count($lines2);$i++){ array_push($lines,$lines2[$i]); } for($i=0;$i< count($lines3);$i++){ array_push($lines,$lines3[$i]); } // grouping into webpages $pages=[]; $page=[]; $webId=-1; for($i=0;$i0){ array_push($pages,$page); } $page=[]; } array_push($page,$lines[$i]); }else{ //ignoring some header repeating or empty line } } if(count($page)>0){ array_push($pages,$page); } echo count($pages)." pages\n"; // dividing into train / test $testingPercent=0.2; for($k=0;$k< 5;$k++){ $training=[]; $testing=[]; $lowThreshold=floor($testingPercent*$k*count($pages)); $highThreshold=floor($testingPercent*($k+1)*count($pages)); for($i=0;$i< count($pages);$i++){ if($i>=$lowThreshold && $i<$highThreshold){ array_push($testing,$pages[$i]); }else{ array_push($training,$pages[$i]); } } //echo count($training)." training\n"; //echo count($testing)." testing\n"; // generate the feature vector files $trainingFileName="txt_files_datasets/$outputDirectory/training_".($k+1).".txt"; $myfile = fopen($trainingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatFileContents($training)); fclose($myfile); $testingFileName="txt_files_datasets/$outputDirectory/testing_".($k+1).".txt"; $myfile = fopen($testingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatFileContents($testing)); fclose($myfile); // generate the keyword mapping files $trainingFileName="txt_files_datasets/$outputDirectory/training_kw_".($k+1).".txt"; $myfile = fopen($trainingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatKeywordFileContents($training)); fclose($myfile); $testingFileName="txt_files_datasets/$outputDirectory/testing_kw_".($k+1).".txt"; $myfile = fopen($testingFileName, "w") or die("Unable to open file!"); fwrite($myfile, formatKeywordFileContents($testing)); fclose($myfile); } function formatFileContents($pages){ $string=""; for($i=0;$i< count($pages);$i++){ for($j=0;$j< count($pages[$i]);$j++){ $line=$pages[$i][$j]; // generating feature vector $comp=explode(",",$line); $vector=[]; for($k=1;$k< count($comp);$k++){ // no need for word itself starting at 1 array_push($vector,trim($comp[$k])); } for($k=0;$k< count($vector);$k++){ $string.=$vector[$k]; if($k< count($vector)-1){ $string.=" "; }else{ $string.="\n"; } } } } return $string; } function formatKeywordFileContents($pages){ $string=""; for($i=0;$i< count($pages);$i++){ for($j=0;$j< count($pages[$i]);$j++){ $line=$pages[$i][$j]; // generating feature vector $comp=explode(",",$line); $vector=[]; array_push($vector,trim($comp[0])); for($k=0;$k < count($vector);$k++){ $string.=$vector[$k]; if($k< count($vector)-1){ $string.=" "; }else{ $string.="\n"; } } } } return $string; } End php>

    # # removes the non existing files without tags #rename the files # 220 to 204 reduced # change the numbers into binary files import dranks as D from nltk.corpus import stopwords import re stp ="january use jun jan feb mar apr may jul agust dec oct nov sep dec product continue one two three four five please thanks find helpful week job experience women girl apology read show eve knowledge benefit appointment street way staff salon discount gift cost thing world close party love letters rewards offers special close page week dollars voucher gifts vouchers welcome therefore march nights need name pleasure show sisters thank menu today always time needs welcome march february april may june jully aguast september october november december day year month minute second secodns".split(" ") common_nouns='debt est dec big than who of com offer sale the in fi'.split(" ") stps=set(stopwords.words("english")) spchars = re.compile('\`|\~|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\_|\+|\=|\\|\||\{|\[|\]|\}|\:|\;|\'|\"|\<|\,|\>|\?|\/|\.|\- ') from collections import defaultdict from flask import Flask import requests from bs4 import BeautifulSoup import dranks as D import warnings warnings.filterwarnings("ignore") import csv ########################################################################################################################################### import re spchars = re.compile('\`|\~|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\_|\+|\=|\\|\||\{|\[|\]|\}|\:|\;|\'|\"|\<|\,|\>|\?|\/|\.|\- ’') import dranks as D stp ="january use jun jan feb mar apr may jul agust dec free oct nov sep dec product continue one two three four five please thanks find helpful week job experience women girl apology read show eve knowledge benefit appointment street way staff salon discount gift cost thing world close party love letters rewards offers special close page week dollars voucher gifts vouchers welcome therefore march nights need name pleasure show sisters thank menu today always time needs welcome march february april may june jully aguast september october november december day year month minute second secodns".split(" ") common_nouns=['debt', 'est' 'dec' ,'big' ,'than', 'who','one','two','three','of','four','five','al sisi','free','gift','voucher','vouchers','try','best buy buying'] common_words =['of','for','the','www','fi','com','free','try','best'] name_of_file=['germon_stopwords_file','finnish_stopwords_file','english_stpwords_list'] germon_stopwords_file=D.Stop_Words_list(name_of_file[0]) finnish_stopwords_file=D.Stop_Words_list(name_of_file[1]) english_stpwords=D.Stop_Words_list(name_of_file[2]) stop_word_list =[english_stpwords, finnish_stopwords_file, germon_stopwords_file] ################################################################################################## def Clean_text(text): Words =[] for word in text.split(): word = word.replace("’",' ') word = word.lower() word = spchars.sub(" ",word.strip()) if word not in stps: if word not in stp: if len(word)>1: if word != " ": if word not in common_nouns: if not word.isdigit(): if word not in stop_word_list[0]: for x in word.split(): if x not in stp and x not in stps and len(x)>1 and x not in common_nouns and x not in common_words: x = x.strip() if not x[0].isdigit(): Words.append(x) return (Words) def Remove_duplicates_GT(l): new_list =[] [new_list.append(x) for x in l if x not in new_list] return (new_list) def Clean_CSV_Files(words): w=[] import string from string import digits exclude = set(string.punctuation) if type(words) is not list: words = [(spchars.sub(" ", i)).replace('\n','').strip().lstrip(digits)for i in words.split(',') if i not in exclude ] else: words =[(spchars.sub(" ", i)).replace('\n','').strip().lstrip(digits)for i in words if i not in exclude ] for x in words: if len(x)>0: w.append(x) return(w) def Tf_Score(fr,text_length): if text_length < 50: tf_score =((fr/100)*50) else: tf_score=((fr/100)*20) return (tf_score) def Feature_Score(candidate_word,feature_words,score): total_score=0 score_single_time =0 for word_feature in feature_words: if word_feature ==candidate_word: #total_score+=score score_single_time = score return(score_single_time) ############################################################################################### def Get_guardian_url_words_list(root): Score =defaultdict() csv.register_dialect('myDialect',delimiter = ',', quoting=csv.QUOTE_NONE, skipinitialspace=True) stop_word_list =[english_stpwords, finnish_stopwords_file, germon_stopwords_file] with open('guardian_402.csv', 'a',encoding ='utf-8') as f: writer = csv.writer(f, dialect='myDialect') writer.writerow(['Word','TF','H1','H2','H3','H4','H5','H6','Anchor','Title','Url-Host','Url-Query','Txt-Lngth','GT','web-id']) for web_id in range(402): try: files= root + str(web_id) + '/tags.txt' urls,GT=D.Read_Txt(files) # seperates the ground truth and url url=str(urls) text,HTML = D.Web(url) H1, H2, H3, H4, H5, H6, anchor, title = D.Extract_headerAnchorTitle(HTML) url_host, url_query = D.Urls(url) words =Clean_text(text) text_length =len(words) words_and_freq = D.Calc_word_frequency(words) score = [] in_gt =0 for word,fr in words_and_freq.items(): score_h1 = Feature_Score(word,H1,1) #6 score_h2 = Feature_Score(word,H2,1)#5 score_h3 = Feature_Score(word,H3,1)#3 score_h4 = Feature_Score(word,H4,1)#2 score_h5 = Feature_Score(word,H5,1)#2 score_h6 = Feature_Score(word,H6,1)#2 score_anchor = Feature_Score(word,anchor,1)#1 score_title = Feature_Score(word,title,1)#6 score_url_host = Feature_Score(word,url_host,1)#5 score_url_query = Feature_Score(word,url_query,1)#4 tf_score = Tf_Score(fr,text_length) if word not in GT: in_gt = 0 else: in_gt = 1 writer.writerow([word,str(fr),str(score_h1),str(score_h2),str(score_h3),str(score_h4),str(score_h5),str(score_h6),str(score_anchor),str(score_title),str(score_url_host),str(score_url_query),str(text_length),str(in_gt),str(web_id)]) #score =word,str(fr),str(score_h1),str(score_h2),str(score_h3),str(score_h4),str(score_h5),str(score_h6),str(score_anchor),str(score_title),str(score_url_host),str(score_url_query),str(text_length),str(in_gt),str(web_id)] #print ([word,str(fr),str(score_h1),str(score_h2),str(score_h3),str(score_h4),str(score_h5),str(score_h6),str(score_anchor),str(score_title),str(score_url_host),str(score_url_query),str(text_length),str(in_gt),str(web_id)]) except: print (web_id) pass ############################################################################################################## root ='/home/tko/himat/web-docs/keywordextraction/dataset2/theguardian/' from collections import defaultdict,Counter Get_guardian_url_words_list(root) ##################################################################################################### def Rename_files(path): import os files = os.listdir(path) i = 0 for file in files: os.rename(os.path.join(path, file), os.path.join(path, str(i))) i = i+1 #path=r'/home/tko/himat/web-docs/keywordextraction/sets/indianexpress/' #Rename_files(root) ########################################################################### import os def File_exists(root,ranges): for x in range (ranges): my_path = root + str(x) + '/tags.txt' if os.path.exists(my_path) and os.path.getsize(my_path) > 0: p =0 else: print (x) #################################################################################### #Creating the ground truth (GT) txt file f def GT_txt(): txt_file = open('gt.txt','w',encoding ='utf-8') for web_id in range(402): files= root + str(web_id) + '/tags.txt' urls,GT=D.Read_Txt(files) GT =' '.join(GT) txt_file.write(str(web_id) + ' ' + str(GT) + '\n') print (GT) print ('---------------------',web_id) txt_file.close()

    ACI-Rank

    In this ACI-Rank section we we will discuss four main parts:
    (1) Introduction (2) Methodology: (3) Python Implementation (4) Output Section

    (1) Introduction

    Update soon

    (2) Methodology

    Underprocess.

    ACI-Rank

    Fig. 6 Shows ACI-Rank workflow

    (3) Python Implementation

    1. Extract Text
    2. Preprocess Text
    3. POS Tags Seperation
    4. Word Net semantic similarity
    5. Cluster Words
    6. Keyword Ranking and Selection

    # Imports
                import urllib
                import nltk
                import sys
                import re 
                
                import lxml
                import math
                import string
                import textwrap
                import requests
                
                from nltk.corpus import stopwords
                from bs4 import BeautifulSoup
                from nltk import word_tokenize
                from nltk.stem import WordNetLemmatizer
                from collections import defaultdict,Counter
                from nltk.corpus import stopwords
                from collections import defaultdict 
                from bs4.element import Comment
                Common_Nouns ="january debt est dec big than who use jun jan feb mar apr may jul agust dec oct ".split(" ")
                URL_CommnWords =['','https','www','com','-','php','pk','fi','http:','http']
                URL_CommonQueryWords = ['','https','www','com','-','php','pk','fi','https:','http','http:']
                UselessTagsText =['html','style', 'script', 'head',  '[document]','img']
                from nltk import wordpunct_tokenize
                from urllib.parse import urlparse 
                
                
                warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
                warnings.filterwarnings("ignore")
                from nltk.stem.snowball import SnowballStemmer
                stemmer = SnowballStemmer("finnish")
                # Stopwords imports
                from nltk.corpus import stopwords
                STP_SET_ENG_NLTK = set(stopwords.words("english"))
                F_stopwords = set(stopwords.words("Finnish"))
                english_stop_words =[x for x in STP_SET_ENG_NLTK]
                finnish_stop_words =[x for x in F_stopwords]
                Eng_Finn_Combine_Stpwrds = english_stop_words + finnish_stop_words
                
                # New imports
                def Scrapper1(element):
                    if element.parent.name in [UselessTagsText]:
                        return False
                    if isinstance(element, Comment):
                        return False
                    return True
                
                def Scrapper2(body):             
                    soup = BeautifulSoup(body, 'lxml')      
                    texts = soup.findAll(text=True)   
                    name =soup.findAll(name=True) 
                    visible_texts = filter(Scrapper1,texts)        
                    return u" ".join(t.strip() for t in visible_texts)
                
                def Scrapper3(text):                  
                    lines = (line.strip() for line in text.splitlines())    
                    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
                    return u'\n'.join(chunk for chunk in chunks if chunk)
                
                
                def Scrapper_title_4(URL):
                  req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
                  con = urllib.request.urlopen(req)
                  html= con.read()
                  title=[]
                  
                  soup = BeautifulSoup(html, 'lxml') 
                  title.append(soup.title.string)
                  return(title,urls)
                
                def Web_Funtion(URL):
                  req = urllib.request.Request(URL, headers={'User-Agent' : "Magic Browser"})
                  con = urllib.request.urlopen(req)
                  html= con.read()  
                  Raw_HTML_Soup = BeautifulSoup(html, 'lxml') 
                 
                  raw =Scrapper2(html)
                  Raw_text = Scrapper3(raw) 
                  return(Raw_text,Raw_HTML_Soup) 
                
                ##################################################################
                def Clean_NoSyn(No_syn):
                    Words =[]
                    for x in No_syn:
                        x = x.strip('.').strip(':').strip('?').strip('/').strip("'").strip ("©").strip("»").strip("/").strip(" ").strip(",")
                        for n in x.split('.'):
                            for k in n.split('-'):
                                for m in k.split('/'):
                                    if m not in ["©","»","/"," "] and len(m)>1 and m.isalpha():
                                        Words.append(m)
                    return (Words)
                
                def explode(h_d):
                    alt_words=[]
                    if len(h_d)>0:
                        for k,i in h_d.items():      
                   
                            for x in i:
                                word=[n for n in x.split(',')]
                                for x in word:
                                    words=[i for i in x.split() ]
                                    for x in words:
                                        alt_words.append(x)
                        return(alt_words)
                    else:
                        return(alt_words)
                    
                def get_text(soup,h):
                    text=[]
                    zero=[]
                    for w in soup.find_all(h):
                        h_text = w.text.strip()
                        h_text =h_text.replace(':','') #change made here
                        h_text =h_text.replace(',','')
                        
                        #h_text =(h_text.lower())
                        #change made here 
                        for x in h_text.split('-'):
                            text.append(x)
                    if len(text)!=0:
                        return(text)
                    else:
                        return(zero)
                    
                def Extract_headerAnchorTitle(soup):
                    h1_d ={}
                    h2_d ={}
                    h3_d ={}
                    h4_d ={}
                    h5_d ={}
                    h6_d ={}
                    title_d={}
                    
                    h1_d['h1']= get_text(soup,'h1')
                    h2_d['h2']= get_text(soup,'h2')
                    h3_d['h3']=get_text(soup,'h3') 
                      
                    title_d['title']= get_text(soup,'title')  #CALLing      
                    H1=explode(h1_d)
                    H2=explode(h2_d)
                    H3=explode(h3_d)
                    
                
                    
                    T= explode(title_d)
                    return(H1,H2,H3,T)
                
                
                def Bold_italic_text(HTML):    
                    bold_italic_text2 =[]
                    bold = [w.text for w in HTML.find_all('bold')]
                    italic = [w.text for w in HTML.find_all('i')]
                    bold2 = [w.text for w in HTML.find_all('b')]
                    strong  =bold2 = [w.text for w in HTML.find_all('strong')]
                    bold_italic_text = bold + italic + strong
                    for x in bold_italic_text:
                        x = x.split()
                        for i in x:
                            bold_italic_text2.append(i)         
                                   
                    return (bold_italic_text2)
                
                def Bold_italic_Score(feature,score,Upper,Capital,Stpwords_list):
                    feature_dic ={}
                    if len(feature)> 0: 
                        feature =[x for x in feature if x not in Stpwords_list ]
                        
                        if Upper is True:
                            list_bold = [x for x in feature if len (x) >1 and x[0].isupper() and not x[1].isupper()]
                        if Capital is True:
                            list_bold = [x for x in feature if len (x)>1 and x.isupper() ]
                            
                            
                        
                        len_f = len(feature)
                        Counters = Counter (list_bold)
                        for x,i in Counters.most_common():
                            v = (i /len_f) *score
                            feature_dic[x.lower()]=v
                            
                            
                    return (feature_dic)  
                
                def Get_Nosynsets(Text):
                    no_syn_words =[]
                    for i in Text.split():
                        
                        a1 =wn.synsets(i)
                        
                        if (len(a1))< 1:
                            
                            if i not in STP_SET_ENG_NLTK and len(i) > 1:
                                
                                no_syn_words.append(i.lower())
                    
                   
                   
                    Words = Clean_NoSyn(no_syn_words)
                   
                    return (Words)
                
                
                def Score_feature(feature,score,stopwords_list):
                    feature_dic ={}
                    U_first = Bold_italic_Score(feature,2,True,False,stopwords_list)
                    C_all= Bold_italic_Score(feature,3,False,True,stopwords_list)    
                    
                    if len(feature)> 0:   
                        Score =0
                        feature = [x for x in feature if x not in stopwords_list]
                        len_f = int(len(feature))
                        
                        Counter_feature = Counter(feature)
                        
                        for x, i in Counter_feature.most_common():
                            
                            v = (i /len_f) *score
                            U = U_first.get(x)
                            C = C_all.get(x)
                            Score = v
                            #if C is not None:
                                #Score += C
                            #if U is not None:
                                #Score += U
                                
                            
                            feature_dic[x.lower()]=Score
                    else:
                        return (feature_dic)
                    return (feature_dic)
                
                def Check_null(word, feature_dict):
                     m1 = feature_dict.get(word)
                     if m1 is None:
                         m1 = 0
                     return (m1)
                def Check_value(word,h1,h2,h3,host,Query,Title):
                    m1 = Check_null(word, h1)
                    m2 = Check_null(word, h2)
                    m3 = Check_null(word, h3)    
                    
                    m4 = Check_null(word, host)
                    m5 = Check_null(word, Query)
                    m6 = Check_null(word, Title)  
                    return (m1,m2,m3,m4,m5,m6)
                def Get_Nouns_without_Stopwords(Text):
                    Nouns =[]
                    for line in Text.split():
                       
                        
                        tokens = nltk.word_tokenize(line)
                        
                        tagged = nltk.pos_tag(tokens)    
                        for x,y in tagged:
                          if y in ['NNP','NNPS','NNS','NN']:
                              #Nouns.append(x)
                              if x not in STP_SET_ENG_NLTK:
                                  Nouns.append(x)                
                    return (Nouns)
                def Function_ParseURL(URL):
                    URL =str(URL)
                    host=[]
                    obj=urlparse(URL)    
                    name =(obj.hostname)
                    if len(name)>0:
                        for x in name.split('.'):
                            if x.lower() not in URL_CommonQueryWords:
                                host.append(x)
                        else:
                            host.append(name)
                    path=[]
                    host_part_URL =[]
                          
                    for url_parts in URL.split('/'):
                        for url_part in url_parts.split('.'):            
                            if (len(url_part)>0):
                                for url_words in url_part.split('-'):
                                    if url_words.lower() not in URL_CommnWords and url_words.lower() not in host: 
                                        path.append(url_words.lower())
                            else:
                                path.append(url_parts)                
                    return(host,path)
                
                
                def Frequent_Words(Text):
                     
                    #1 Remove stopwords and pre-process
                    Cand_Words = [x for x in Text.split() if x not in Eng_Finn_Combine_Stpwrds]
                    Cand_Words= Clean_NoSyn( Cand_Words)
                    Cand_Words = [x.strip().lower() for x in Cand_Words if x not in ["©","»","/"," "] and len(x)>1 and x.isalpha()]
                       
                    #4 Counting freuqencies of candidate words
                    Cand_100_Words_list =[]
                    lengt_text = len(Cand_100_Words_list) 
                    Count_Cand_Words = Counter(Cand_Words)
                    Top_10_keywords=[]
                    for word,count in Count_Cand_Words.most_common(10):
                        Top_10_keywords.append(word)
                    return (Top_10_keywords)
                
                
                def Generate_100_Candidate_Keywords(URL,Text, HTML,STEMS):
                    
                    #1 Remove stopwords and pre-process
                    Cand_Words = [x for x in Text.split() if x not in Eng_Finn_Combine_Stpwrds]
                    Cand_Words= Clean_NoSyn( Cand_Words)
                    Cand_Words = [x.strip().lower() for x in Cand_Words if x not in ["©","»","/"," "] and len(x)>1 and x.isalpha()]
                    
                
                    
                    #4 Counting freuqencies of candidate words
                    Cand_100_Words_list =[]
                    lengt_text = len(Cand_100_Words_list) 
                    Count_Cand_Words = Counter(Cand_Words)
                    for word,count in Count_Cand_Words.most_common(100):        
                        
                        if STEMS:
                            Cand_100_Words_list.append (stemmer.stem(word))
                        else:
                            Cand_100_Words_list.append(word)
                        
                    return (Cand_100_Words_list)
                
                def Extract_keywords_Base(URL,Text, HTML, N):
                    #Base method only make all false
                    STEMS =False        
                    Cand_100_Words_list = Generate_100_Candidate_Keywords(URL,Text, HTML,STEMS)    
                        #2 Features URL list
                    url_host, url_query = Function_ParseURL(URL)
                    url_query = [x.strip().lower() for x in url_query if x not in ["©","»","/"," "] and len(x)>1 and x.isalpha()]
                    
                    #3 Feature Headers and title list
                    H1, H2, H3,title = Extract_headerAnchorTitle(HTML)
                    bold_italic = Bold_italic_text(HTML)
                    bold_italic = Clean_NoSyn(bold_italic)
                        
                    #5 Score to the base Features(6) words
                       
                    h1 = Score_feature(H1,4,Eng_Finn_Combine_Stpwrds)
                    h2 = Score_feature (H2,3,Eng_Finn_Combine_Stpwrds)
                    h3 = Score_feature(H3,2,Eng_Finn_Combine_Stpwrds)     
                    host =Score_feature(url_host,4,Eng_Finn_Combine_Stpwrds)
                    Query = Score_feature (url_query,4,Eng_Finn_Combine_Stpwrds)
                    Title = Score_feature (title,4,Eng_Finn_Combine_Stpwrds)
                    Bold = Score_feature(bold_italic,2,Eng_Finn_Combine_Stpwrds)
                    
                    #6 Go through all the 100 candidate 100 words
                    Cand_Words_Score_dic ={}
                    
                    for cand_word in Cand_100_Words_list:
                        
                        #7 Check_values for null if not null score for cand words
                        H1_Score ,H2_Score ,H3_Score,URL_Host_Score,URL_Query_Score,Title_Score = Check_value(cand_word,h1,h2,h3,host,Query,Title)#no_syn_words,B)
                        Final_feature_Score0 = round (H1_Score + H2_Score + H3_Score)
                        Final_feature_Score1 = round (H1_Score + URL_Host_Score + Title_Score)
                        Final_feature_Score2 = round (H1_Score + URL_Host_Score + Title_Score + URL_Query_Score)
                        Final_feature_Score3 = round (URL_Host_Score + Title_Score + URL_Query_Score)
                        Final_feature_Score4 = round (H1_Score + H2_Score + H3_Score + URL_Host_Score + URL_Query_Score + Title_Score , 2)
                            
                       
                        #9 Store all cand 100 words and their features scores in dictionary        
                       
                        
                        Cand_Words_Score_dic[cand_word] =   Final_feature_Score1
                  
                    
                    
                    #10 Counts the dictinary to get top 10 words
                    Counts_Final_Features_Scores = Counter(Cand_Words_Score_dic)
                    keywords =[]
                    # 11 set number of keywords in case of mopsi 5
                    Number_of_keywords = 10
                        
                    
                    for word, fr in Counts_Final_Features_Scores.most_common(Number_of_keywords):
                        keywords.append(word)
                    
                    
                    #11 return the keywords for base method
                        
                    return (keywords)   
                    
                ###########################################################################
                def Score_in_Feature(candidate_word,feature_score_dic):
                    New_feature_dic ={}
                    for word in candidate_word:
                        Feature_Score = Check_null(word, feature_score_dic)
                        New_feature_dic[word] = Feature_Score
                    Counts_Final_Features_Scores = Counter( New_feature_dic)
                    Number_of_keywords = 10
                    keywords =[]
                    for word, fr in Counts_Final_Features_Scores.most_common(Number_of_keywords):
                        keywords.append(word)
                    
                    return(keywords)
                if __name__ == "__main__":    
                    URL ="http://bbc.com"    
                    Text, HTML =Web_Funtion(URL)
                    Keywords  = Extract_keywords_Base(URL,Text, HTML,False)
                    print (Keywords)
                

    Output Section

     Keywords
                bbc homepage dotcom window new the bbcdotcom
                 uk eurovision ads
                  

    Ends all methods