TF-IDF

Introduction

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It combines two metrics: Term Frequency (TF) and Inverse Document Frequency (IDF). The TF-IDF value increases proportionally with the number of times a word appears in the document and is offset by the frequency of the word in the corpus.

Components of TF-IDF

Term Frequency (TF): Measures how frequently a term appears in a document. It’s calculated as:

     ( TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} )

    Inverse Document Frequency (IDF): Measures how important a term is. While computing TF, all terms are considered equally important. IDF reduces the weight of terms that appear very frequently in the document set and increases the weight of terms that appear rarely. It’s calculated as:

         $$ IDF(t,D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents with term } t} \right) $$


    TF-IDF: The product of TF and IDF for a term. It’s calculated as:

         $$ \text{TF-IDF}(t,d,D) = \text{TF}(t,d) \times \text{IDF}(t,D) $$

      Example Calculation

      Consider a small corpus of three documents:

      • Document 1 (D1): “the cat sat on the mat”
      • Document 2 (D2): “the cat sat”
      • Document 3 (D3): “the cat”

      Let’s calculate the TF-IDF for the term “cat” in each document.

      1. Calculate Term Frequency (TF)
        • D1: TF(cat, D1) = 1/6 (since “cat” appears once and there are 6 words in D1)
        • D2: TF(cat, D2) = 1/3
        • D3: TF(cat, D3) = 1/2
      2. Calculate Inverse Document Frequency (IDF) : The term “cat” appears in all three documents, so:

           $$ \text{IDF}(\text{cat}, D) = \log \left( \frac{3}{3} \right) = \log(1) = 0 $$

      Since the IDF is zero, it means the term “cat” is not useful in distinguishing between documents in this corpus.

      When using sklearn‘s TfidfVectorizer, the IDF calculation includes smoothing to prevent division by zero. The formula used by sklearn is:

           $$ \text{IDF}(t, D) = \log \left( \frac{1 + \text{Number of documents with term } t}{1 + \text{Total number of documents}} \right) + 1 $$

      Code to calculate TF-IDF

      from sklearn.feature_extraction.text import TfidfVectorizer
      import pandas as pd
      
      # Sample corpus
      corpus = [
          "the cat sat on the mat",
          "the cat sat",
          "the cat"
      ]
      
      # Initialize the vectorizer
      vectorizer = TfidfVectorizer()
      
      # Fit and transform the corpus
      X = vectorizer.fit_transform(corpus)
      
      # Convert the result to a dense matrix and print it
      df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
      print(df)
      

      sklearn TfidfVectorizer normalizes the vectors by default. This normalization affects the final TF-IDF scores, ensuring they are unit vectors.

      The sklearn TfidfVectorizer uses L2 normalization by default. The L2 norm of a vector

           $$ v = \left[ v_1, v_2, \dots, v_n \right] $$

      is defined as:

           $$ | v | = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} $$

      with above code, we get the following output

              cat       mat        on       sat       the
      0  0.284077  0.480984  0.480984  0.365801  0.568154
      1  0.522842  0.000000  0.000000  0.673255  0.522842
      2  0.707107  0.000000  0.000000  0.000000  0.707107

      =0.284077^2+2(0.480984^2)+0.365801^2+0.568154^2 = 1.0000

      Code to visualize TF-IDF

      To visualize the TF-IDF score we can use this code

      import pandas as pd
      import matplotlib.pyplot as plt
      import seaborn as sns
      from sklearn.feature_extraction.text import TfidfVectorizer
      
      # Sample corpus
      corpus = [
          "the cat sat on the mat",
          "the cat sat",
          "the cat"
      ]
      
      # Initialize the vectorizer
      vectorizer = TfidfVectorizer()
      
      # Fit and transform the corpus
      X = vectorizer.fit_transform(corpus)
      
      # Convert the result to a dense matrix and create a DataFrame
      df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
      
      # Add document identifiers
      df['document'] = ['D1', 'D2', 'D3']
      
      # Melt the DataFrame for easier plotting
      df_melted = df.melt(id_vars='document', var_name='term', value_name='tfidf')
      
      # Plot the TF-IDF scores
      plt.figure(figsize=(12, 8))
      sns.barplot(data=df_melted, x='term', y='tfidf', hue='document')
      plt.title('TF-IDF Scores for Terms in Each Document')
      plt.xlabel('Term')
      plt.ylabel('TF-IDF Score')
      plt.legend(title='Document')
      plt.show()
      
      Visualization results TF-IDF

      Use Cases

      TF-IDF (Term Frequency-Inverse Document Frequency) is widely used in various text mining and natural language processing (NLP) applications due to its simplicity and effectiveness in identifying important terms within documents. Here are some common use cases:

      1. Information Retrieval

      TF-IDF is used to rank documents based on their relevance to a query. Documents with terms that have high TF-IDF scores for the query terms are considered more relevant.

      • Example: Search engines use TF-IDF to index web pages and rank search results.

      2. Text Classification

      TF-IDF is often used as a feature extraction technique for text classification tasks. It helps in transforming text into numerical vectors that can be fed into machine learning models.

      • Example: Spam detection in emails, sentiment analysis, and topic categorization.

      3. Document Similarity

      TF-IDF is used to measure the similarity between documents by comparing their TF-IDF vectors.

      • Example: Finding duplicate documents, clustering similar documents, and recommending similar articles.

      4. Keyword Extraction

      TF-IDF can be used to extract keywords or key phrases from a document, as terms with high TF-IDF scores are considered important.

      • Example: Summarizing articles, generating tags for documents, and content analysis.

      5. Content Recommendation

      TF-IDF vectors can be used to recommend content based on user preferences and document similarities.

      • Example: Recommending news articles, research papers, or products based on textual descriptions.

      6. Document Clustering

      TF-IDF is used to convert documents into numerical vectors for clustering algorithms like K-means, enabling the grouping of similar documents.

      • Example: Grouping customer reviews, organizing a large corpus of text into coherent clusters.

      In the coming blogs, we will read about BM-25 (Best Matching 25), and then about ReRanker, which are important components to improve the RAG performance

      https://www.anthropic.com/news/contextual-retrieval

      1 thought on “TF-IDF”

      Leave a comment