Want to check details on what your newsroom wrote about a topic a year ago? Need to be up to date about an evolving but quite unspecified topic on the newswire? Wish to keep web visitors on your site by offering more of the same sort that brought them there in the first place? Need to get a fast overview of how different agencies have covered a hot topic? Would like to get all this without having to invest a lot of work in maintaining taxonomies, ontologies, and named entity lists?
Lingsoft FindSimilar combines three well-established natural language processing technologies – latent semantic analysis, the finite-state two-level morphology model for word form analysis, and constraint grammar for disambiguation – to an innovative, cost-effective and well-integrated solution for finding relevant similarities in textual content.
First you create a similarity matrix. Before you can start using FindSimilar for your actual tasks, you need to build a similarity matrix. FindSimilar has an easy-to-use Matrix Builder, where you can define a story collection and the parameters for the matrix. If you want to cover a newspaper archive containing maybe millions of stories, you may need thousands of stories for a good matrix.
However, in sharp contrast to many labour-intensive methods, FindSimilar creates its similarity matrix practically without manual intervention or fine-tuning, still bringing equal or better relevance to similarity-type of searches.
- Lingsoft's advisor and customer's language manager specify the matrix parameters and hit Build.
- FindSimilar retrieves from the story archive the collection of stories (typically hundreds or thousands) on which the similarity matrix will be based.
- FindSimilar builds a matrix of story vectors, and stores it in a relational database. FindSimilar uses Lingsoft's word form analyzer and constraint grammar parser for the corresponding language (in this case Finnish) to retrieve disambiguated base forms for the words in the stories. This procedure is optional, but highly recommended, as it compresses the story vectors substantially, increasing the precision and performance of the matrix.
Kick-start with FindSimilar Clinic. There are some key parameters, which need to be adjusted experimentally, based on how the similarity vector matrix behaves with the specific characteristics of the targeted content. Therefore we recommend that these parameters are jointly designed and iteratively tested by Lingsoft's advisor together with the customer in a Lingsoft FindSimilar Clinic.
The initial matrix-building phase is now completed. Once the parameters are tested and set in place, re-building the matrix and re-vectorizing the story archive is a straightforward task that can be automated and scheduled to run for instance nightly or over the weekend. Regular re-builds are needed for making the matrix aware of new topics, terms and expressions.
Then you vectorize your content. One of the great features of FindSimilar is, that once you have a similarity matrix in place, you can let FindSimilar vectorize a massive amount of content, and cover similarities throughout your whole story archive.
- With the similarity matrix in place, you can schedule FindSimilar to continue by indexing the story archive, which may contain millions of stories.
- Again, FindSimilar uses Lingsoft's analyzers to retrieve disambiguated base forms for the words contained in each story, in order to compress the similarity vector.
- FindSimilar runs each story against the similarity matrix, and stores similarity vectors and story archive pointers in the same database as the similarity matrix.
This completes the indexing phase. Now the system is ready for use.
Integrates with the editorial system. Find Similar can, and should, be tightly integrated with the editorial system so, that all new stories are vectorized always when saved or modified. While building the matrix is a rather heavy and time-consuming procedure, vectorizing an individual story is very fast.
- You are creating a new story or modifying an existing one in your story editor, and hit Save.
- Your editorial system is triggered to send the story to FindSimilar to be vectorized.
- FindSimilar calculates a compressed similarity vector for the story, and stores the vector of the new or modified story in its vector database.
The following five use cases illustrate how FindSimilar can be used in the publishing industry.
Case 1: Explore your story archives. The most obvious use case for you as an editor is to find relevant reference stories to the one you're writing or reviewing.
- You are halfway writing your column, and want to bring up the most similar stories from your story archive, so you hit Find in your story editing application.
- FindSimilar creates the similarity vector for your new story, compressed with Lingsoft's analyzers. Then it compares the vector with those already in the database, and retrieves the most similar ones to be listed in your story editing application.
- You can select the one you want to view, and your story editing application retrieves the whole story from the story archive.
Case 2: Show similarity-based top lists. The same similarity matrix can serve various purposes, such as showing a "top ten list" of similar stories on your web site. You can either rely on FindSimilar to create a list automatically for each story, or you can decide that you want more control, and validate each top list while publishing or updating a story. In that case your story editing software requests a "top ten list" suggestion each time you create or modify a story, and saves your validated top list as metadata in your story archive.
- A web visitor lands on your news site, presumably caught by a well-written story. He or she wants to read more on the same topic, and hits one of the headlines in the Ten Most Similar list.
- The full story is retrieved from the story archive.
- FindSimilar simultaneously performs a rapid search for the most similar vectors compared to the new story request, and returns a new Ten Most Similar list with the selected story as reference.
Case 3: Flag stories for similarity alerts. You may want to keep yourself informed about a certain topic, say the impact of the fiscal crisis in Greece on the valuation of euro currency. FindSimilar offers an excellent opportunity for that. You can define notification triggers for your topics of interest by selecting a story or a group of stories to which FindSimilar matches the similarity vectors of incoming stories, and notifies you when similar-enough stories appear.
- You want to follow up a certain topic, so you hit Notify while having a matching story or story list open in your story editor. FindSimilar marks the corresponding vector or vectors with a notification parameter.
- Each arriving newswire story is sent to FindSimilar to receive a similarity vector. While storing the vector of a fresh newswire story, FindSimilar scans for notification flags that meet a certain defined similarity threshold. If a similar-enough notification flag is found, FindSimilar notifies you.
- You can now retrieve the full story from the story archive.
Case 4: Arrange incoming content in dynamic clusters. You are probably very familiar with the hassle of multiple iterations of the same newswire story, and similar stories coming from multiple sources. Now you can let FindSimilar automatically group for instance newswire content according to interrelated similarity. You can use this feature for better coverage and overview of incoming content.
- Evolving versions of newswire stories from several wires and agencies are sent to FindSimilar. Each agency may produce a slightly different version of the same news topic. FindSimilar creates a similarity vector for each story, and stores it in its database.
- You start your editing shift and want to view what has happened in certain topics lately, so you hit Topics. Find Similar produces topic groups based on certain configurable similarity thresholds and submits you a list of 'hot topics'.
- You can then retrieve the relevant story or stories from your story archive.
Case 5: Use FindSimilar for classification. FindSimilar is a powerful tool for finding content easily, consistently and, above all, dynamically. One could even claim that traditional classification becomes superfluous. However, you may need to classify your content according to IPTC or some other taxonomy for content syndication, cross-media repurposing, etc. FindSimilar makes it easy and efficient. You just tag manually a certain representative amount of stories. FindSimilar – with a little help from your staff – takes care of the rest.
- Your bright new story is ready, so you hit Save. The Story Editor application sends the story to Find Similar for a classification suggestion.
- Find Similar creates a similarity vector for your new story, compares the similarity vector of your new story with previously created vectors, and delivers the classification tags of the most similar stories to your story editing software, thus suggesting one or more classes for your story.
- You can accept the suggestion as such, or you can edit it to better represent the story content. Hit Save and you're done.
How to buy
Copyright ©1986-2019, Lingsoft Ltd.