Archive for the ‘Statistical Natural Language Processing’ Category

High-quality personal filtering

August 25, 2008

Imagine a personal filter that connects people to web-content and other people based on interests and memories they uniquely have in common – enabling discovery of new knowledge and personal connections. High quality matches offering freedom from the chore of reading every item so you don’t worry about missing something you care about in your email, on the web, or in a social network. 

Background

There are 1.3 billion people on the web and over 100 million active websites. The Internet’s universe of information and people, both published and addressed to the user is growing every day. Published content includes web pages, news sources, RSS feeds, social networking profiles, blog postings, job sites, classified ads, and other user generated content like reviews. Email (both legitimate and spam), text messages, newspapers, subscriptions, etc. are addressed directly to the user. 

The growth of Internet users and competition among publishers is leading to a backlog or heap of hundreds or thousands of unread email, rss, and web content in user inboxes/readers — forcing users to settle somewhere between the extremes of either reading the all the items or starting fresh (as in “email bankruptcy”) .

Search is not enough.  If you don’t have time to read it, a personal filter reads it for you.  If search is like a fishing line useful for finding what you want right now, a personal filter is a fishing net to help you capture Internet content tailored to your persistent interests — for which it is either painful or inefficient to repeatedly search and get high quality results on an hourly or daily basis.

Techniques

Personal filtering can help users recover the peace of mind that they won’t miss an item important either in the email inbox, on the web, or in a social network — provided matching algorithms deliver sufficient signal-to-noise ratio (SNR). For example, an alert for “Devabhaktuni Srikrishna” produces relevant results today because it is specific – i.e. SNR is high. Only 1:100 or so alerts for “natural language processing” or “nuclear terrorism” end up being relevant to me. Signal is low, noise is high. So SNR is low but can be improved through knowledge of personal “context” gained via emails, similar articles, etc. Successful/relevant matches are all about context – what uniquely describes a user, and can we find it in other news or users?

  1. The relevance of a match should be easily/instantly verifiable by the user.
  2. Statistical analysis of phrases is proven to be a useful method to extract meaningful signal from the noise and find precise matches.
  3. Phrases describing each person are in what users write, read, and specify.

Framework

Let’s define the user corpus to be all content describing users, which includes the user’s own email, blogs, bookmarks, and social profiles. Interests can also be derived from items or links the user decides to upload or submits specifically by email – “like this.” As new information arrives the most likely and statistically useful interests are automatically detected through natural language processing algorithms (aka text mining), similar to Amazon’s Statistically Improbable Phrases (SIPs). The interests are displayed on a website analogous to friend lists in social networking, the interest list is automatically ranked, and the user vets the list to keep only the most meaningful interests.

 

Once a user’s interest list is generated, it can be used to filter all information in the entire web corpus via open search engine APIs (such as Yahoo-BOSS and Live Search API 2.0) to create a personal search index. The filter may also be aimed at a subset of the web which we call the user-targeted corpus configured directly by the user – including incoming email, alerts, RSS feeds, a specific news source (like the Wall Street Journal), social networking profiles, or any information source chosen by the user. There are now open-standard APIs to email (like IMAP for Gmail and Yahoo Mail Apps), social networks (like Facebook API and Opensocial for Linkedin, Hi5, MySpace, and other social networking sites), and newspapers like the New York Times.

 

The result: A manageable number of high-quality filtered matches, ranked and displayed for the user, both viewable on the website and in the form of a periodic email (daily update).

 

The Personal Filtering Space

References statistical analysis of phrases

Personalized news/RSS filtering

“Recommended News” in Google News, Yahoo’s personalized news portal, My HakiaTechnorati, Newsfire, Digg, Socialmedian, Alerts.com, Dailyme, Loud3r, Feedhub, the personal portals from WSJ/NYTimes/WashPost, NewsGator/RivalMap, AideRSSDayLife, Reddit, Feeds 2.0, Twine, Feedrinse, and Particls, Filtrbox, Meehive, Pressflip, Factiva by Dow-Jones, HiveFire (see blog post and demo), Sprout,  my6sense  (see video and review), iDiscover (iPhone app), Skygrid for mining financial information in blogs (see article), Uptodate and JournalWatch for tracking medical research by specialty,  Anchora.

Content-to-content matching

SphereAngstro (mining the web for mentions of your friends using their social profiles), Pique by Aggregate Knowledge, Loomia, MashLogic (see TechCrunch writeup) and Zemanta search for and display “related” content on blog posts and news articles currently browsed — analogous to how Google AdSense displays contextually relevant ads, except these services serve up blog/news/other content. Inside Gmail, Google displays related content links below their ads under “More about…”

Person-to-person matching

EHarmony, Twine, Redux, Searchles, PeopleJar, and several Facebook apps have tried to do things like this FriendAnalyzer.

Persistent search

Factiva Alerts by Dow-Jones, Google Alerts, Filtrbox (see Mashable interview), Yotify (see Mashable writeup), CleanOffer for real estate listings, Everyblock for news stories about your own block through RSS or email, Trovix searches jobs on Monster.com, and Oodle searches multiple classified sites. Operating systems and email clients have had a user-configured “Virtual folder” feature for some time – configured by the user (manual), or auto-generated whenever a user performs a search (over-active). , Pressflip for news sources like Reuters, UPI, etc. There is also a way to do Persistent search in Gmail

Some examples of academic/research projects

Creating hierarchical user profiles using Wikipedia by HP-Labs, Interest-Based Personalized Search (Utah), The CMU Webmate, UIUC’s “AN INTELLIGENT ADAPTIVE NEWS FILTERING SYSTEM”, “Hermes: Intelligent Multilingual News Filtering based on Language Engineering for Advanced User Profiling (2002)”. Fido from Norway –“A Personalized RSS News Filtering Agent”. HP researchers compared three ways of detecting tags in “Adaptive User Profiles for Enterprise Information Access using del.ico.us bookmark collections. “Writing a Personal Link Recommendation Engine”  in Python Magazine demonstrates a web document ranking system based on content in del.ico.us bookmarks.

Natural Language Processing & Search

Cognition, Hakia, and Powerset (acquired by Microsoft). Google personalization incorporates Personalized PageRank and uses Bayesian Word Clustering to find related keywords — see “Introduction To GoogleRankingand “Technologies Behind Google Ranking.

Collaborative Filtering

Digg, Socialmedian, StumbleUpon, FastForward by BuzzBoxAmazon.com (for books), “Recommended News” in Google News, Twine, Filtrbox.

Personal email analysis and indexing

Xobni, Xoopit, ClearContextEmail Prioritizer by Microsoft Research (see writeups here and here)

Personal Music Filtering

Pandora

Recommender Systems

See the list of recommender systems in Wikipedia.

Related Posts

Bill Burnham writes,

…no one has yet to put together an end-to-end Persistent Search offering that enables consumer-friendly, comprehensive, real-time, automatic updates across multiple distribution channels at a viable cost.

Personalized Clustering: It’s too hard, say developers

Because let’s face it, Personalization + Clustering is the next big step in RSS. If 2005 was about Aggregation, then 2006 is all about Filtering.
Nik wrote up his thoughts today, in a post entitled Memetracking Attempts at Old Issues. While he mentions lack of link data as being an issue, it seems to me the crux of the problem is this:

“generating a personal view of the web for each and every person is computationally expensive and thus does not scale, at all.”

He goes on to say that “this is why you don’t have personalized Google results – we just don’t have the CPU cycles to care about you.”

So it’s mainly a computational and scaling problem. Damn hardware.

I think enabling technology will be new and better algorithms. Also see the links in Filtering Services and “Why Filtering is the Next Step for Social Media”.

Web 3.0 Will Be About Reducing the Noise

…if you think it is hard enough to keep up with e-mails and instant messages, keeping up with the Web (even your little slice of it) is much worse…. I need less data, not more data.

Bringing all of this Web messaging and activity together in one place doesn’t really help. It reminds me of a comment ThisNext CEO Gordon Gould made to me earlier this week when he predicted that Web 3.0 will be about reducing the noise. (Some say it will be about the semantic Web, but those two ideas are not mutually exclusive). I hope Gould is right, because what we really need are better filters.

I need to know what is important, and I don’t have time to sift through thousands of Tweets and Friendfeed messages and blog posts and emails and IMs a day to find the five things that I really need to know. People like Mike and Robert can do that, but they are weird, and even they have their limits. So where is the startup that is going to be my information filter? I am aware of a few companies working on this problem, but I have yet to see one that has solved it in a compelling way. Can someone please do this for me? Please? I need help. We all do.”

The Music Genome Project powers Pandora

The Music Genome Project, created in January 2000, is an effort founded by Will Glaser, Jon Kraft, and Tim Westergren to “capture the essence of music at the fundamental level” using over 400 attributes to describe songs and a complex mathematical algorithm to organize them. The company Savage Beast Technologies was formed to run the project.

A given song is represented by a vector containing approximately 150 genes. Each gene corresponds to a characteristic of the music, for example, gender of lead vocalist, level of distortion on the electric guitar, type of background vocals, etc. Rock and pop songs have 150 genes, rap songs have 350, and jazz songs have approximately 400. Other genres of music, such as world and classical, have 300-500 genes. The system depends on a sufficient number of genes to render useful results. Each gene is assigned a number between 1 and 5, and fractional values are allowed but are limited to half integers.[1] (The term genome is borrowed from genetics.)

Given the vector of one or more songs, a list of other similar songs is constructed using a distance function.

To create a song’s genome, it is analyzed by a musician in a process that takes 20 to 30 minutes per song. Ten percent of songs are analyzed by more than one technician to ensure conformity with the standards, i.e., reliability.

The technology is currently used by Pandora to play music for Internet users based on their preferences. (Because of licensing restrictions, Pandora is available only to users whose location is reported to be in the USA by Pandora’s geolocation software).[2]

How to Extract a Webpage’s Main Article Content

I had an idea to make a personalized news feed reader. Basically, I’d register a bunch of feeds with the application, and rate a few stories as either “good” or “bad”. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as “good”. It sounded like a plausible idea. I decided to start a pet project.

I soon learned that this idea wasn’t original, and in fact had been attempted by quite a few companies. The first to seriously implement this idea was Findory, later followed by Thoof, Tiinker, Persai, and probably others I’m not aware of. As of this writing, only Persai is still in business. Apparently, personalized news feeds aren’t terribly profitable. Why they’re not a commercial hit is a whole article in itself, so I won’t go into it now. However, before I admitted to myself that this project was doomed to failure, I decided to implement a few components to get a better feel for how the system would work. This is a review of a few interesting things I learned along the way.

TechCrunch writes about PeopleJar

The site’s most powerful feature is its robust search function, which allows users to search for others using many criteria. After creating a search, users can choose to have the site persistently monitor for any matches in the future.

Robert Pasarella writes about Tools for the Equity Research Toolbox

One of the paramount abilities of a good analyst is to spot trends early and realize their potential impact on a company or industry. What analysts are usually searching for is any hint of weakness or strength in competitive advantage. Sometimes the smallest trends start in the local newspapers. Google News makes locating those topics and stories much easier.

If you pair Google News with the enhanced filtering ability of Yahoo! Pipes, and your favorite feed reader; you can create some worthwhile tools that help your trend seeking abilities.

Here is an example I’ve been working on as part of a wider range of investment ideas on Oil.

My first approach was to set up a search in Google News that highlighted anytime OIL was in the title of a story. You can do that with the ‘allintitle’ operator and since I wanted US based sources I added the ‘location’ operator with USA as the source- it looks like this in the Google search window.
http://news.google.com/news?hl=en&ned=us&q=allintitle:oil+location:USA&ie=UTF-8&scoring=n

To see more useful operators check out the Google Cheat Sheet

There are choices on the page to make this search into an RSS feed. Clicking a link on the page will create a feed url in either RSS 2.0 or Atom. You can then take that feed and do further refining in Yahoo! Pipes. I like to create a broad search from Google News and then apply a layer of filters in Pipes for key terms that I think are important. Once I have configured Pipes to my liking it becomes a feed for my RSS reader. I also created a pipe that looks at the opinion and editorial feeds from certain newspapers. Those in the analyst community will recognize this technique as a kin to using Google alerts. Using RSS is the better mousetrap and it doesn’t clog you mailbox.

Cull Web Content With Alerts, by By KATHERINE BOEHRET, Wall Street Journal

For years, I’ve used Google Alerts as a way of keeping track of myself online. If my name is mentioned in a blog or if this column appears on the Web, such as on the site of a newspaper that syndicates it, a Google Alert sends me an email about it. Google Alerts can work for you to find a variety of things, such as telling you if a video of a favorite band popped up online or that a blogger posted something about last night’s episode of “Mad Men.”

In about a month, Google will begin delivering these alerts to users via feeds, as well as emails. Google certainly isn’t alone in the alerts arena, as Yahoo, Microsoft and AOL are also players. This week I tried two small companies that recently joined the mission to help users find the Web content using alerts.

I tried Alerts.com and Yotify.com, and found worthwhile features in both. While Google Alerts does a good job of finding search terms in news, blogs and videos, Alerts.com and Yotify use forms that are a cinch to fill out and let you pinpoint your searches.

Web 3.0: Not Yet by Navdeep Manaktala from New Delhi,

Major aspects of Web 2.0–vertical search, commerce, community, professional and user-generated content–have been panning out nicely, albeit slowly. But personalization, my cornerstone concept for Web 3.0, languishes.

I do believe, however, that we are going to move toward a more personalized and satisfying user experience within the next decade. After all, we went from Web 1.0 in 1995 to Web 2.0 in 2005. Only three years into Web 2.0, perhaps it is natural that we stay here and fine-tune for another five to seven years, before the real breakthrough innovations can come about and usher in Web 3.0.

Struggling to Evade the E-Mail Tsunami by RANDALL STROSS of the New York Times,

E-MAIL has become the bane of some people’s professional lives. Michael Arrington, the founder of TechCrunch, a blog covering new Internet companies, last month stared balefully at his inbox, with 2,433 unread e-mail messages, not counting 721 messages awaiting his attention in Facebook. Mr. Arrington might be tempted to purge his inbox and start afresh – the phrase “e-mail bankruptcy” has been with us since at least 2002. But he declares e-mail bankruptcy regularly, to no avail. New messages swiftly replace those that are deleted unread.
When Mr. Arrington wrote a post about the persistent problem of e-mail overload and the opportunity for an entrepreneur to devise a solution, almost 200 comments were posted within two days. Some start-up companies were mentioned favorably, like ClearContext (sorts Outlohok inbox messages by imputed importance), Xobni (offers a full communications history within Outlook for every sender, as well as very fast searching), Boxbe (restricts incoming e-mail if the sender is not known), and RapidReader (displays e-mail messages, a single word at a time, for accelerated reading speeds that can reach up to 950 words a minute). But none of these services really eliminates the problem of e-mail overload because none helps us prepare replies. And a recurring theme in many comments was that Mr. Arrington was blind to the simplest solution: a secretary.

When Mr. Arrington wrote a post about the persistent problem of e-mail overload and the opportunity for an entrepreneur to devise a solution, almost 200 comments were posted within two days. Some start-up companies were mentioned favorably, like ClearContext (sorts Outlook inbox messages by imputed importance), Xobni (offers a full communications history within Outlook for every sender, as well as very fast searching), Boxbe (restricts incoming e-mail if the sender is not known), and RapidReader (displays e-mail messages, a single word at a time, for accelerated reading speeds that can reach up to 950 words a minute).

Sarah Perez of ReadWrite Web proposes five solutions to email overload,

  • get it done
  • 4-hour work week
  • email as sms
  • folders/rules
  • email bankruptcy

In Email Hell, Ross Mayfield writes in Forbes

E-mail overload is the leading cause of preventable productivity loss in organizations today. Basex Research recently estimated that businesses lose $650 billion annually in productivity due to unnecessary e-mail interruptions. And the average number of corporate e-mails sent and received per person per day are expected to reach over 228 by 2010.

Clint Boulton writes in “Study: Collaboration Overload Costs U.S. $588B a Year “

“Information Overload: We Have Met the Enemy and He is Us,” authored by Basex analysts Jonathan B. Spira and David M. Goldes and released Dec. 19, claims that interruptions from phone calls, e-mails and instant messages eat up 28 percent of a knowledge worker’s work day, resulting in 28 billion hours of lost productivity a year. The $588 billion figure assumes a salary of $21 per hour for knowledge workers.
The addition of new collaboration layers force the technologies into untenable competitive positions, with phone calls, e-mails, instant messaging and blog-reading all vying for workers’ time.
For example, a user who has started relying on instant messaging to communicate may not comb through his or her e-mail with the same diligence. Or, a workgroup may add a wiki to communicate with coworkers, adding another layer of collaboration and therefore another interruption source that takes users away from their primary tasks.
Beyond the interruptions and competitive pressure, the different modes of collaboration have created more locations through which people can store data. This makes it harder for users to find information, prompting users to “reinvent the wheel because information cannot be found,” Basex said.
Basex’ conclusion is that the more information we have, the more we generate, making it harder to manage.

In “The threat from within,” COL. PETER R. MARKSTEINER writes,

If a technological or biological weapon were devised that could render tens of thousands of Defense Department knowledge workers incapable of focusing their attention on cognitive tasks for more than 10 minutes at a time, joint military doctrine would clearly define the weapon as a threat to national security.
Indeed, according to the principles of network attack under Joint Publication 3-13, “Information Operations (IO),” anything that degrades or denies information or the way information is processed and acted upon constitutes an IO threat. That same publication cautions military leaders to be ever-vigilant in protecting against evolving technologically based threats. Yet throughout the Defense Department and the federal government, the inefficient and undisciplined use of technology by the very people technology was supposed to benefit is degrading the quality of decision-making and hobbling the cognitive dimension of the information environment.
We all receive too much e-mail. According to the Radacati Research Group, roughly 541 million knowledge workers worldwide rely on e-mail to conduct business, with corporate users sending and receiving an average of 133 messages per day – and rising. While no open-source studies address how the Defense Department’s e-mail volume compares to corporate users’, my own anecdotal experience and that of legions of colleagues suggests a striking similarity. Without fail, they report struggling every day to keep up with an e-mail inbox bloated with either poorly organized slivers of useful data points that must be sifted like needles from stacks of nonvalue-adding informational hay or messages that are completely unrelated to any mission-furthering purpose.
E-mail is a poor tool for communicating complex ideas. Text-only communication, or “lean media,” as it is referred to by researchers who study the comparatively new field of computer mediated communication, lacks the nonverbal cues, such as facial expression, body language, vocal tone and tempo, that inform richer means of communication. Moreover, aside from its qualitative shortcomings and viral-like reproductive capacity, a growing body of research suggests e-mail’s interruptive nature is perhaps the most pressing threat to decision-making in the cognitive dimension.

In the Future of Search Won’t Be Incremental, Adam DuVander writes,

Personalization isn’t only coming, it’s here. Sign in to your Google account and you can activate it. Prepare to be underwhelmed. But even if it were as Carrasco describes, privacy concerns would stop personalized search from being adopted until the benefits were undeniable. It would take a radical shift.

When Google came along, it provided something that had never been seen before: good search results. Unlike all the other search engines, Google’s top few slots had what we were looking for. And it provided them fast.

It was a much easier time to make big changes. Someone has to make us realize that Google’s results are as antiquated as Yahoo and Excite were in the late 90s. A change in interface might be the most likely innovation.

Sphere which was acquired by AOL News displays articles “related” to the content of the page currently viewed by the user, and now powers over 100,000 sites including many major news outlets like Wall Street Journal, Time, Reuters, etc. This is also the back-end service (see content widget) used to generate “possibly related posts” on WordPress.

Sphere’s founder explains why they created it,

“We founded Sphere with a mission to make contextually relevant connections between all forms of content (mainstream media articles, archived articles, videos, blogs, photos, ads) that enable the reader to go deep on topics of interest,” wrote Conrad.

At the time of its acquisition, Sphere reaches a large number of webpages

Sphere’s third-party network includes more than 50,000 content publishers and blogs and is live on an average of more than 2 billion article pages across the web every month.*

Om Malik writes about Spheres original concept for Blog Search,

The way Sphere works is a combination of many tracks. Lets use an example say of what else, Broadband. The look for blogs that write about broadband, (including those with broadband in the title of the blog) to create a short list. If I am linking to someone who is also a broadband blogger, and vice-versa, Sphere puts a lot of value on that relationship. The fact is most of us broadband bloggers tend to debate with each others. Think Blog Rank, Instead of Google’s Page Rank. The company has also taken a few steps to out-smart the spammers, and tend to push what seems like spam-blog way down the page. Not censuring but bringing up relevant content first. They have pronoun checker. Too many I’s could mean a personal blog, with less focused information. That has an impact on how the results show up on the page.

John Batelle on how Sphere works,

It pays attention to the ecology of relationships between blogs, for example, and it gives a higher weighted value to links that have more authority. This will insure, for example, that when a Searchblog author goes off topic and rants about, say, Jet Blue, that that author’s rant will probably not rank as high for “Jet Blue” as would a reputable blogger who regularly writes about travel, even if that Searchblog author has a lot of high-PageRank links into his site. Sphere also looks at metadata about a blog to inform its ranking – how often does the author post, how long are the posts, how many links on average does a post get? Sphere surfaces this information in its UI, I have to say, it was something to see that each Searchblog post gets an average of 21 links to it. Cool!

Melodee Patterson writes,

Last week I was on vacation without Internet access. Now that REALLY slowed down my infomaniac impulses! I had to settle for flipping through the stack of magazines that I had brought with me, and reading the ebooks that I had downloaded previously and stored on my hard drive. (Thank God I had thought ahead!)

While on vacation, I did have time to think about how much time I did waste on aimless browsing and unfocused research. So I created a list of changes that I’m going to implement this week. That includes:

  • removing any Google alerts that have not provided themselves useful over the past 6 months
  • deleting any RSS feeds that have not added to my knowledge or imagination
  • hitting “unsubscribe” to ezines that I don’t really read
  • using a timer to keep my Internet rovings to 15 minutes (unfortunately, it has a 7 minute snooze button)
Advertisement