Original Content and Text Processing
Searchers search for relevant documents using a search engine. Those are in a body of documents. The search engine returns a list of documents to respond to a search query. Search results order depends on the relevance, or rank, of each document. How does the search engine identify original content?
Learn about a document based on how other documents from that corpus refer to the particular document. Hyperlinks of documents in the corpus determine the rank of a document. Explicit references to a document do not tell us whether the document’s content is unique compared to other documents. Some documents from the corpus may contain identical or near-identical content. Search results that include a document can also contain all its copies or near copies.
Even though each copy is separate, each copy provides little or no further information to information-seeking users. SERPs with almost identical content can obscure other search results that contain unique content.
This patent tells us about ranking a document based on original content. It uses ways to identify which content is original content.
Start by Finding an Original Content Piece in a Corpus of Documents
One way this patent identifies original content is to identify the first document in a body of documents.
The first document may contain an original content piece that does not occur in any earlier document in the collection. It has a first author. That first author is associated with a first rank. The first rank of the first author uses a score of the content piece. The score is a figure of merit of the content piece.
You may come to see a second document when collecting documents. That second document could be from a second author. The score of the second piece can determine a second rank of the second author. Determine a second rank can include enhancing the first rank and then using the score of the content piece to reduce the second rank.
That content piece can include determining documents in the collection that contain the content piece. It is later than the first document. And the rank of the second content piece may depend on the rank of the first document.
The content piece rank can depend on the number of words in the content piece. The document’s rank in collecting documents can depend on the author’s rank associated with the document.
An author can be:
- A domain
- A uniform resource locator
Associating A Time with Original Content
Each document can have a time used to determine whether each document in the collection of documents is earlier or later. The time associated with a document can be when the document becomes part of the collection of documents. Time that goes with a document in the collection of documents can be:
- The creation date
- The last modified date of the file containing the document
Crawling Based on the Author
A web crawler can add documents to the collection.
The frequency when crawling documents associated with an author may depend on the author’s rank.
Depth at which to crawl documents associated with an author may also depend on the author’s rank.
A collection of documents can contain documents accessible on the World Wide Web. One or more content pieces can be from the first document. Content pieces from the first document can include condensing the content of the first document. Each of the content pieces can come from the condensed content.
Each content piece can use many words from the first document. Each content piece can contain a fixed number of words. The plurality of words may not span paragraph boundaries.
Advantages of Following the Original Content Described in the Patent
Search results are less likely to contain redundant or copied content. They will yield more interesting original content results.
The popularity of a piece of content does not rely on explicit document referencing or citation. The likelihood that content will be interesting depends on the content author’s historical propensity to produce original content.
A site’s content can be pieces of a predetermined length and translated into a common base language. The rank of a particular document (e.g., ranked based on explicit links to the particular document). It can contain content copied from one document to another document (e.g., the document that originally contained the content).
The ranks of Authors can depend on the scores of associated content pieces:
- Original content pieces associated with the author and the number of copies of original content pieces associated with the author
- Number of sources of content pieces copied by the author
- The proportion of content pieces copied from another source
The rank of the author’s documents.
The Ranking of Documents Depends on What?
The ranking of Documents will look at the:
- Rank of associated authors
- Score of associated original content pieces
- Number of sources of copied content
- Proportion of copied content
The collection of documents can encompass documents accessible on the World Wide Web. The frequency and depth of web document crawling can depend on the rank of an author.
Ranking content using content and content authors
Inventors: Douwe Osinga, and Stefan Christoph
US Patent: 10,970,353
Granted: April 6, 2021
Filed: January 17, 2019
Methods, systems, and apparatus, including computer program products for identifying original content. In one aspect, a method is described that includes identifying the first document in a collection of documents. The first document contains a content piece, which does not occur in any earlier document in the collection. The first document goes with a first author and the first author associated with a first rank. The first rank of the first author uses a score of the content piece. The score is a figure of merit of the content piece.
Original Content Uses a Respective Time
Each document can be in one or more formats. These can include HTML, PDF, text documents, word processing documents, Usenet articles, email messages, or any other document containing content or a combination of these.
Documents do not need to be in a file but can be part of a file containing other documents or other content (such as images). A file can contain many email messages, with each considered a separate document. It can be a single document stored in many related files. These documents can be a web page, file, or part accessible on the World Wide Web.
In general, a document contains content. The content can be textual (e.g., letters, words, sentences, paragraphs, or pages). It can also be non-textual content. Those would include images, sound, video, electronic games, interactive content, and combinations thereof. You can compare the content of one document to another. Although this specification focuses on textual content, the description is an example. The techniques described here apply to and can work with other types of content. These can include images, sounds, or multimedia content.
Also, Each document may have a time. The time associated with a document can be a date (e.g., day, month, year) and a time (e.g., hour, minute, second). Showing Time can vary. Associating a time with a document refers to when the document was either published or created. The start of the time can be when the document or the file in which the document has been last modified, created or accessed. It can approximate the publication time based on when the document is first encountered or discovered. That can be by a web crawler that accesses documents found on the Web).
Looking At Respective Times When Documents are in a Corpus
The time associated with a document showed when the document became part of a corpus of documents.
Each can be a document in a body of documents. The body is a collection of documents that contains many (e.g., thousands or millions) documents or references to documents. Those documents can become part of the corpus over time. The time associated with a document showed when it became part of the corpus. Each added document shows a later time than the time associated with the documents already in the corpus. The added document can become associated with an earlier time than other documents in the corpus.
A document becomes part of the corpus at and associated with a time. Another document is from a later time. When a document becomes part of the corpus, it is determined whether the content in the added document is the same as any of the content already contained in the corpus documents.
If the corpus contained only documents, the text content of documents could go with the text content of later documents. If the same text appears in both documents, then the two documents contain the same content. In particular, a part of the content can be within both documents.
The Earlier Document is likely the Earliest Occurrence Of the Content
Of the two documents, the earlier document is likely the earliest occurrence of the content. The latter document can be a partial copy of the earlier document.
The particular part of content could be original when referring to the content’s earliest occurrence (e.g., document). In contrast, the same part of the content might not be original content if referring to the content’s occurrence in any other document (e.g., document). The document may be the source of the copied content because the content piece does not occur in any earlier document. A part of the content in many documents is original in the document associated with the earliest date.
The content may have been from the earliest document in all later documents. This is true even if the content may not be a copy (e.g., when the same content is independently created on two or more occasions).
The extent of content copying by one document of another document’s content can depend on a variety of characteristics. That can include a score of each content piece. The number of documents containing that was from and the proportion of the other document’s content copied. The extent of copying can rank the copier.
Generating Content Pieces from a Document
The document might be a web page accessible on the Web. It might contain words, sentences, paragraphs, punctuation, images, or other attending data. Content in the document can be extracted, isolated, or analyzed. The process can look at the textual content and can drop or ignore non-textual content.
The search engine can use the document’s textual content while omitting other attending data (e.g., images or formatting).
Extracting content can also include standardizing the content.
Standardizing the content means that text can be:
- Punctuation removed
- Encoding converted
- Characters transliterated
- Combinations of these
The content is the standardized content of the original content from the document.
How Stop Words Might Fit Into This Original Content Process
Condensation of the content can take place. Omit “Stop” words or ignore them before, during, or after the document’s content fragmentation. Stop words are words that are often found within the entire content of the corpus. For a corpus that contains documents from several languages, stop words can include frequently used words from each language. In some implementations, the top 500 most commonly found words in each known language stop words (e.g., in English language documents, “and,” “the,” “it,” and others).
Words in a language that are stop words can vary. The number of stop words can be dynamically based on a statistical analysis of a sample of documents known to be in a given language (e.g., the most commonly used one percent of words). The condensed content is the standardized content of each document. Marking out several words in the content. Each black-out mask indicates that the covered word is an omitted or ignored stop word.
Fragment the content of a document into pieces. Each piece of content contains a part of the content from the document. Fragmentation happens when you split a document’s content into pieces containing a predefined number (e.g., four) words, for example.
The patent provides an example from the non-stop words contained in the document. It uses 13 content pieces. Each of the content pieces contains four consecutive words from the content.
How to Identify Copied Content
Consider the following passage:
Throw your soldiers into positions whence there is no escape, and they will prefer death to flight. If they will face death, there is nothing they may not achieve.
The passage can be the following condensed 9 four-word pieces, “throw soldiers positions whence,” “soldiers position whence escape,” “position whence escape prefer,” “whence escape prefer death,” “escape prefer death flight,” “prefer death flight face,” “death flight face death,” “flight face death nothing,” “face death nothing achieve.” Assume that all of the omitted words are stop words. If this passage occurs in another document, then fragmentation yields another instance of the same 9 four-word pieces that will result.
Each content piece contains words in the same order as found in the document. In some implementations, the content of each piece does not span paragraphs. Identify paragraphs in a document by delimiters found in the document. For example, in HTML, tags such as “<BR>,” “<P>,” “<H1>” can delimit paragraphs; in an ASCII text file, a paragraph can be a carriage return or a line feed.
Ignored content pieces created from the end of paragraphs that contain less than the pre-defined number of words.
Ranking a Document Based on Its Content Pieces
The process includes identifying a corpus of documents. For example, you could identify a body of Web pages found on the Web. Or a remote repository of documents (in a network drive, a database, or another content source) may be a body of documents.
Each document in the identified corpus of documents could be content pieces. Fragment those documents in the corpus. The process includes identifying the earliest occurrence of each content piece to determine which of the document’s content pieces are original.
Compare the content pieces of the document before you record them to a repository (either a list or a data structure or database) of known content pieces that appear in earlier documents.
If you compare that document’s content pieces, you can determine whether any of the document’s content pieces are original. If a content piece was not already part of the repository, the content piece did not occur in an earlier document.
The repository of content pieces can also include attending information about each piece. This could be the time of the earliest occurrence of the piece or a record of the piece’s occurrences in later documents. The repository is persistent (e.g., stored on a storage device). As you add documents to the corpus, their content pieces update the repository. When you see an original content piece within a document, its appearance becomes part of the repository. The document’s other content pieces can also update the repository (e.g., update the record of occurrences for other content pieces).
How Fragmenting Documents Or Identifying Original Content Pieces Fits In
Fragmenting documents or identifying original pieces can include translating each piece into a common base language. The base language can be any natural language for which translation is possible. The translated piece can determine whether content from the document has occurred in an earlier document in any language.
If two pieces of the content contain words in two different languages, then the two pieces may contain the same conceptual content. Consider that the English piece “boy girl dog cat” and the French piece “garcon fille chien chat” contain different words. By translating one piece to the other’s language, you can compare both pieces to one another.
Using Translations of Fragments of Content To Determine If It Is Original
A particular piece can yield one translated piece or several translated pieces in the base language, depending on whether the piece yields many potential translations. Or, rather than translating each piece, you can translate a document before fragmentation. To determine whether a particular piece is original, you can compare the piece and its potential translations to all pieces in the same base language.
For example, if you translate the piece “boy girl dog cat” into “garcon fille chien chat” and vice-versa, then only one of the pieces in the above example can be an original piece of content. The piece associated with the earlier time (e.g., that occurs in the earlier document) is, the earlier of the two occurrences.
The process can include scoring each of the document’s content pieces. Score a document’s content piece based on whether the content is original or not. For example, if an original content piece occurs in the document, you can score the piece highly (e.g., one score). In contrast, you can score it neutrally or negatively (e.g., zero or -1, respectively).
The particular numerical value used to score a piece can vary among implementations. Also score a content piece based on other documents that the content piece later occurs. For example, the scoring of the earliest occurrence of a content piece in proportion to the number of later documents the content piece occurs in.
Determine a Content Baseline by Specifying a Threshold Date
Determine a content baseline by specifying a threshold date. Any content pieces in a document associated with a time before the threshold date are part of the content baseline. Consider content pieces in the content baseline, neither original nor copied. Pieces in the content baseline capture commonly used phrases or particular quotations that are widely recited. Pieces contained in a later document (e.g., with a later time) but found in the content baseline are not considered copied.
If the piece “mother wears army boots” occurs in the content baseline, the piece occurs in a document with a time earlier than the specified baseline threshold date. Don’t score it when the same piece occurs in a later document (e.g., associated with a time later than the threshold date), then the content piece or score it neutrally. Establishing a content baseline can be particularly useful when it is difficult to establish authorship of existing content, for example, when initially adding pre-existing content to populate an empty corpus.
The process can include using the score of a document’s content pieces to rank the documents in the corpus. Sum, average, or otherwise combine the score of each content piece to produce a total score for the document.
The total score can rank the document based on its content pieces. For example, you can improve a document’s rank for each piece of content that occurs first in the document (e.g., based on the piece’s score).
In contrast, it may reduce a document’s rank if the document contains only copied content pieces.
A Document with Original Content Ranks Higher Than Other Documents
Rank a document containing original content more highly than a document that contains no original content. Improving the rank of documents that contain original content recognizes that original content is typical of more interest to information seekers.
As described above, the score of a content piece can also capture how extensively the original content occurs in other documents. Thus, ranking a document based on the score of its content pieces can recognize documents with copied original content.
The rank of a document can be [further] improved because another document has copied the content piece. Improving the rank of documents that contain copied original content recognizes that content is typically copied when it is of interest or significance to others.
The process can include ranking a document based on the number, or extent of distribution, of sources of content pieces copied in the document. The source of a copied content piece refers to the document that the piece first occurred in. A document containing copied content from many other documents can be more highly ranked than a document that has copied original content from only one document.
Documents that have copied portions of original content from many documents may be providing relevant content aggregation (e.g., making content recommendations, analysis, or review).
Ranking a Document Based on the Proportion of Content Copied From Another Source
The process can include ranking a document based on the proportion of content copied from another source document.
Worsen a document’s rank in proportion to the relative entirety of content copied from another document. The search engine may penalize the rank of a document that has copied another document’s original content more than another document that has copied only a small part of the other document’s original content.
Documents that have completely copied the content from another document or copied content from only one document are less likely to be quoting content from the content’s source. If you copied original content from a first document into one or more second documents, then the rank of the secondary documents (e.g., the rank based on explicit references) can determine a rank for the first document.
Although each reference may refer to one of the secondary documents, you may rank the first document based on each reference. All secondary documents refer (e.g., by copied content) to the first document.
Who Is a Document’s Author?
The author of a document can refer to the factual creator of the document and its content. When the factual authorship of the document cannot be reliably established, the search engine may use alternative information to identify the author of the document.
A document’s author can refer to the user’s username who created or last modified the document’s file. You can analyze the documents of an author to determine (e.g., learn) content patterns that are distinct to the particular author (e.g., word choice, unique content patterns, and stylistic or structural similarities).
The content of other documents can analyze content patterns and determine the likelihood that an author is the author of a document based on matching the author’s and the document’s content patterns.
The Document’s Author Refers to the Server Name or the Domain Specified by the Document’s URL
Where you retrieve each document in the corpus (e.g., by a crawler) from an addressable location (e.g., from a storage device, storage repository, database, network device, or the Web), the document’s author can refer to all or part of the document’s location. You can specify the location of a document with a unique name, address, or Universal Resource Locator (URL). The URL can include a server name, a domain name, and the document’s name. The URL can also include subdomains and a path referring to the document’s location on the server. For example:
And, the document’s author refers to the server name or the domain specified by the document’s URL. The author for the document at the URL in the example above can be: “domain.com.” Deriving the author from a URL can also include subdomains and paths depending on the desired specificity of authorship. A server with many documents to distinguish authorship according to whether it came from “bees.lotsofdocs.com” or “kneeslotsofdocs.com.”
Documents originating from a domain with few documents can attribute all documents to the same author. For example, documents found at “resume.homepage.com” and “about.homepage.com” can both have their authorship attributed to “homepage.com.”
In general, you can associate a particular author with many documents in the corpus. You can associate documents with author A, and attribute the content of those documents to that author.
You can attribute to author A, both pieces of original content in each document. Assume that the author associated with a document that is the earliest occurrence of a content piece has authored the content piece. Then assume that if the content piece occurs in a later document, the author has likely copied the content piece from the earlier document (and the previous document’s respective author) in which the content piece first occurred.
The document added at a third point contains a piece of content that has not occurred in any prior document, exemplified as “.tau..epsilon..rho..mu.”
You Can Rank Authors in Addition to Ranking Documents
Just as you can rank documents, so too can you rank authors. Rank an author according to associated content in documents from the author. Ranking an author based on the authors’ content can be analogous to ranking a document based on the documents’ content.
Rank an author based on the content contained in the documents associated with the author.
Base the ranking on the score of the pieces. Base the ranking on the quantity of the pieces.
Rank an author in proportion to how you attribute original content to the author. If you attribute Author A to two pieces of original content, rank that author more highly than either Author B or Author C, who are not attributed to any original content.
Ranking an Author Based on Content Pieces in Documents Associated With the Author
The process includes identifying documents, and fragmenting each document, and identifying original content pieces.
The process includes scoring each content piece about an author. Score each piece about each document the piece occurs in. Score each piece for each attribution for each unique author (e.g., each author of the documents that the piece occurs in).
Score the piece based on the association of the earliest occurrence of the content piece. The content piece in a document about author Score A with a positive score (e.g., one) because the piece’s earliest occurrence is in a document attributed to author A.
Give the same content piece in another document, for author B, a neutral or negative score (e.g., zero or -1 respectively) because the piece’s earliest occurrence is in a document not attributed to author B. Scoring the content pieces in this manner rewards authors of original content (e.g., author A) and penalizes authors of copied content (e.g., author B).
The original content piece can be further scored in proportion to the number of other documents attributed to other authors that the content piece occurs in. For example, improve the score of the content piece by two for each of the two other documents in which the content piece occurs because you attribute both documents to an author other than author A.
Or, score the piece in proportion to the number of other authors that the content piece has copied. If the same author had authored documents in the previous example, improve the content piece by one since only one author copied the content piece, albeit into two separate documents.
The process can include ranking an author based on the score of the scored content pieces. Sum, average, or combine each content piece attributed to the author to produce a total score. Base the total score on the number of documents attributed to the author. Use the total score of an author to rank the author.
To reward authors who produce content across several documents, rather than original content in a single document, multiply the total score of the author by the number of documents attributed to that author that contain original content.
Rank an Author or a Document
The process can include ranking an author based on the number, or extent of distribution, of sources of content pieces copied by the author. Improve an author’s rank if the author has copied content from many other documents rather than from a single document (e.g., if the author aggregates portions of content from many sources into a single document). If the author has copied content from many other authors rather than from the documents of a single author, you can also improve an author’s rank.
The process can include ranking an author based on the proportion of content copied from another source (e.g., another author or document).
Rank an author that copies all the original content from a document lower than an author that copies only part of the original content from the same document. Such ranking recognizes that an author who includes a small quote from an original document may be more interesting than an author who copies the entire contents of another document.
The Ranking of an author can be based on the rank of the author’s documents. Base the document’s rank on the originality of the document’s content. Rank a document based on characteristics of the document that can take time to develop or measure. For example, rank a document based on whether other documents explicitly reference the document (e.g., by hyper-linking) or positive recommendations by other processes, systems, or users. In some implementations, base the rank of a document on a combination of rankings.
Ranking an author based on the rank of documents attributed to the author recognizes the author’s content according to the evaluation metric captured by the document’s rank.
Rank an author based on one or a combination of the aforementioned methods.
Ranking A Document-Based On the Rank Of the Author Associated with the Document
This process can include ranking a document based on the rank of the author associated with the document. For example, improve a document’s rank if you can associate the document with a highly ranked author.
Ranking a document based on the author anticipates its relevance based on its history of producing original or relevant content. An author who has authored original, interesting, or otherwise highly ranked content is likely to author interesting and original content in the future.
Determine the rank of a document or an author over time as you add documents to the corpus of documents. Over time the rank of both documents and authors can change. As original content from a particular document appears in newly added documents, the rank of the particular document can improve. Likewise, an author’s rank can change as you add documents associated with the author to the corpus. An author’s rank can also change in response to the changing rank of existing documents associated with the author.
The passage of time can affect an author’s rank (e.g., assuming the corpus and its content are otherwise unchanged). An author’s rank can decay over time. Rank decay can reward (or penalize) authors who produce original content pieces over a long period of time compared to the same original content appearing over a short period of time.
A Probabilistic Ranking Model Determines the Likelihood of a Copied Document’s Rank and Age
Determine a probabilistic ranking model for a document. The probabilistic ranking model determines the likelihood of a copied document’s rank and age. If the content of a document gets copied shortly after publication (e.g., the document’s associated time), the probabilistic ranking model can predict that the document is outstanding but with a low degree of certainty. If the content of a document gets copied only once long after publication, the probabilistic ranking model can predict that the document is not very good with a higher degree of certainty.
If you rank a document based on the rank of the document’s author can also depend on the passage of time. For example, rank a new document. A new document could be a document associated with recent time compared to other documents in the corpus. It could be based on the rank of the document’s author.
Ranking a new document based on the rank of its author does not rely on measuring response to the document (e.g., links to the document or copying the document’s content). Rather, rank the new document merely based on the rank (e.g., reputation or historical record). As time progresses, the effect of the author’s rank on the document’s rank can decay.
But, the rank of the document may not change in proportion to the decay. Over time, the document’s rank may improve despite the decay. Improve the document’s rank for other reasons (e.g., copied original content from the document or other documents explicitly referenced the document). Decreasing the effect of an author’s rank on a document’s rank over time recognizes that although the author has a history of producing highly (or poorly) ranked content, rank the document on its author-independent characteristics. These can include copied content, explicit references).
Classifying an Author or a Document Based on the Rate of Copied Content
The rate at which a particular piece of copied content refers to the number of times a content piece is copied in a particular time interval. This process includes identifying a corpus of documents and fragmenting the documents in the corpus. It also includes identifying a document containing the first content piece and later documents (e.g., copying documents) in which the content piece also occurs. For any particular content piece, its rate of copying refers to the number of documents the content piece appears in over specific intervals. These may be hours, days, weeks, months.
The process includes determining the content piece’s copy history.
The copy history of a content piece refers to the content piece’s rate of copying over time. Such as a record of how often and when an original content piece occurs in later documents). A content piece’s copy history can determine how quickly disseminated the content may be. For example, if the original content is from Tuesday, it may appear in another document on Wednesday, another four on Thursday, and fifteen more on Saturday.
On Sunday, the same content may only appear in seven more documents. In this example, the rate of coping increases over time reaches a peak on Saturday, and then the rate of copying falls.
The process includes classifying documents based on the content piece’s copy history. Compare a piece of content’s copy history with predefined patterns to classify the content, the document that the content occurs in, and how associated the document’s author is. Each of the patterns generally describes a trend of copying over time. The classification of documents or authors can characterize the content contained in the document or attributed to the author.
Three Copying Patterns Plotted in Graph
Each line corresponds to a different original content piece. A line’s path in the graph describes the rate of copying over a range of time. The particular range of time used (e.g., minutes, hours, days, weeks, months, or years) can vary among implementations and can depend on:
- The anticipated copying patterns
- Content type
- Frequency of document additions to the corpus
Over the range of time, the drawing shows three data samples. The first data sample is early in time relative to the total time range over which the plotted copy patterns. The data points at the first point A represent the copying rate at that time. The next data point represents a low rate of copying compared to the data point. Such as three occurrences per hour compared to 20 or 25 occurrences per hour, respectively).
The line, for example, plots content that appears, then very quickly occurs in many other documents and then occurs only rarely in subsequent intervals (e.g., as evident from copy data at timeC).
This copy pattern can show syndicated content (e.g., a news article). Syndicated content is widely and regularly distributed for near-simultaneous publication by multiple content publishers. The document can be a syndication source. Classification of the document’s author can also be a syndication source, particularly the author’s association with other similarly classified documents.
The line plots content that is increasingly copied over time. At some later interval, copying gradually falls off (e.g., in the shape of a bell curve). This copy pattern can state that the original content is interesting but has taken time to disseminate. Production of this sort of content is from bloggers or other semi-formal content producers.
Google Looking at the Copy History of Content?
The line plots a copy history that indicates immediate dissemination of content, like news content. But unlike news content, it continues to occur in new documents over an extended period of time. Such content may be consistent with advertising content.
A content’s copy history can classify documents into distinct groups: news, advertising, and informal news/editorial. This can include blogging.
Use these copy patterns and others to classify a content piece, the earliest document in which the piece occurs, the documents that have copied the piece, and the authors of each document. For example, classify a document as “news” if it contains the earliest occurrence of a copy pattern consistent with news content.
Multiple documents with copies of such pieces can be “news aggregators.”
Documents are classified as a “blog source” if it contains the earliest occurrence of a content piece that is blog material.
Classify the document’s author as a “blogging leader,” while you can classify the author of documents with copies of the piece as a “blogging follower.”
Other classifications are possible.
Use document and author classification while ranking authors and documents. The classification of a document or author can affect their respective ranking. The search engine could penalize as an “advertisement,” a document that contains many “advertisements.”. Likewise, you can improve the rank of an author classified as a “blog leader.”
A document and author classification can enhance a searcher’s ability to search the corpus. The search engine could allow searchers to constrain their search results by specifying the desired classification.
Allow a searcher to specify that search results include only the content classified as blog material. Users can also constrain search results by specifying a threshold of popularity determined based on the content’s rate of copying.
A Schematic Diagram of a Search Engine
The search engine may rely on many components for proper operation. The document fragmentation approach finds documents in the search engine’s corpus of documents and fragments each document into pieces of content. The piece tracker records each piece, including the earliest document the piece occurred in and the document’s author. This process also tracks later documents the piece appears in and each document’s respective author. It may summarize information about each piece, including the number of times copied or the number of copied or original pieces contained in a document or attributed to an author. The tracker can also track the copy rate at which each original piece of content..
The search engine can use information that the piece tracker provides by both the author ranker and the document ranker to rank documents in the document’s body. It may do this according to the originality of each document. It may store the rank of each document in the corpus of document. Storing the rank of each author in a corpus of authors records each author and the author’s associated rank.
A Document and Author Classification Module With Information in the Piece Tracker to Classify Authors and Documents
A document retriever component populates the corpus of documents. It continuously or repeatedly retrieves documents and adds them, or references to them, to the search engine’s corpus. A web crawler can be the document tracker. The rank of authors can guide the document retriever. The author of the document is the document’s location.
The document retriever can retrieve documents more frequently and at a greater depth from the highly ranked authors. The search engine may guide the document retriever by the authors classified by the document and author classifier. New documents can be retrieved by the document retriever from authors that have particular predefined classifications. Those particular classifications can state that the author is likely to produce relevant content regularly (e.g., “news” authors).
Search engines may then process those queries received by the search engine and answered by the query processor. The search engine can also retrieve documents that are available on the network. What it knows about original content and copied content and the creators of that content can determine how it ranks content.