These are notes on the talks. To all the presenters I apologize if I misrepresented any information. Let me know I will make updates.
Slides will be available at videolectures.net shortly in the mean time check out the #semsearch2010 twitter stream.
I encourage everyone to look at the papers linked to the presentations. Additionally, Jeff Dalton has a post on the semsearch conference.
Update: Krisztian Balog has posted notes on his blog.
Thanh Tran Duc
Semantic search -> use semantics to enhance any/all of the search steps
Keynote: Barney Pell - Search Strategist and evangelist @ Bing
- Semantics is an 'Opportunity'
- 50% of people rely on search on a typical day
- Satisfaction of search is not going up
- 25% of queries result in quick click backs (User thought it was good but it may not have been for them)
- 42% of sessions need refinement (what specifically user wants)
- s.n. Its good to view search as a session instead of tasks
- 50% of the time searching is on long queries, "Lengthy Tasks"
- Example long query:
- 10 Unique queries, 7 partial re-queries and refinements, 57 minutes (medical)
- 11 Unique queries, 5 partial queries, 33 minutes (travel examples)
- There are opportunities for innovation
- Entity Centered Experiences
- A lot of time is spent on entity centered Experiences
- At Bing, there is entity Cards
- However, Entities are ambiguous.
- Even if you know the entity, you need to uniquely find it
- Need methods for extracting and synthesizing knowledge of disparate content
- Instead of just related search, make them related entities (in Bing)
- See: search Lindsay Vonn
- Semantic Improvements to Core Search
- Semantic Retrieval & Ranking
- Better Entity tagging (Barack Obama. President Obama, the president, he)
- Derive graphs from text
- Semantic Query Understanding
- Presentation and Captions
- Match the meaning of the query not just keywords
- Respect the boundary of the content
- Example:
- ...85% of the population suffers from omega 3 fish oil...
- ...85% of the population sufferer from omega 3 fish oil deficiency...
- Smart Summarization
- Could be too long, Concise, or Concise but misleading
- The highlighted answers include word variations (mocked : parodied)
- Use captions with user data, restaurant, map, reviews, etc..
- Directly Wrap/process structured websites in search (not blindly)
- Faceted search
- Preprocess foods using an open nutritional DB
- Answers and Question Answering
- Structured/Unstructured data
- Medstory.com pre-process pages to view related information my its relevance
- A conversational assistant that allows you to use access different accounts. Example - say you want reservations, its access OpenTable
- Summary
- Deliver great Results
- Richer more organized experience
- Help user accomplish tasks easily
- Conclusions
- User needs have evolved, Data and services are proliferating, search innovations are going live daily
- Bing Demo
- Bing travel "RDU to SFO"
- Restaurants in Raleigh
- You can see reviews
- Entity Cube.com
- Developers of YAGO are now part of Microsoft
- They can take two entities and find the nearest connection of the social graph
* Research: How to search with multiple entities.
* "BMW and Mercedes" = "BMW vs Mercedes"
* What are the other relationships we can exploit between entities?
* What if we have more than one entity?
Paper: Paraphrasing Invariance Coefficient: Measuring Para-Query Invariance of Search Engines by Tomasz Imielinski and Jinyun Yan
Presenter: Jinyun Yan
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/para
- Are the two questions "What is the population of Raleigh" and "How many people live in Raleigh" the same?
- Users ask search engines nicely --> Search engines are still quite sensitive to how the question is formulated. With humans this is not the case!
- Semantic - Understand the meaning behind words
- The search engine should recognize the pair of queries and return the same result.
- Search Metric:
- Semantic invariance -- be invariant to semantically equivalent queries
- Equivalent Queries => para-queries
- How to generate para queries?
- This is a subclass of paraphrase generation, but of course para-query has its own characteristics (e.g. short, few content words)
- Extra information
- Query Reformulation won't promise equivalent meaning.
- Created the Rephraser game
- 430 rounds 15 min each
- Input a start query, a hidden phrase (not visible to players)
- Goal : paraphrase to start query to gain score
- Created as a multilayer racecar game, but players play the game independently
- Use players votes to do the score
- You can use templates with argument slots
- Who is the governor of [X]?
- Para-Queries are searches that have the same top-K url returned by search engine.
- In test the results are low for para-query detection
- Conclusion:
- Current search engines are far away from semantic because they do not recognize similar queries.
- Suggestion
- Measure search quality on paraqueries to ensure a retyped query isn't of less value
Paper: Using BM25F for Semantic Search by José R. Pérez-Agüera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias and Victor Fresno
Presenter: José R. Pérez-Agüera
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f
- Keyword-based semantic search -- has become major research area, garnering much attention in the semantic web over the last seven years.
- Is it possible to improve quality results in terms of relevance applying just classic IR approaches to RDF semantic structure?
- "We wrote this paper for ourselves because we were interested, we thought the community would benefit"
- Main problems: Indexing RDF triples using inverted indexes, Ranking based retrieval for RDF objects
- Store RDF triples in inverted indexes or represent Subject Predicate Object in an n X m matrix
- We modify this in SEMPLORE model what uses the row text of the entity and fields.
- SEMPLORE is basis of the index model
- For a long time search engines ave been dealing with flat documents (ie xml)
- Consequence is that all the terms have the same relevance (bag-of-words)
- Structures IR - location of word gives term relavance and possible enhanced meaning (i.e. boost factors depending on the system)
- Robertson et al 2004 --> The linear combination of weights for each field of the document is not enough if a saturation, like log(tf) or sqrt(tf) is used in the TF function.
- Because just because something that occure 25 more times, doesnt meant it is 25 times more important. You need a log or sqrt curve over the term frequency
- Additional, filters harm this saturation effect, such as in lucene.
- INEX - XML retreiveal competiton
- Lucene is not used in IR conferences, there are both strong and week points. It has a type of boolean retrieval
- Performance using Lucene with added Fields is worse that using Lucene with just a large bag of words approach.
- Conclusions:
- Dont use Lucence ranking function
- There is no good ranking model for semantic search, BM25F is probably the best
Paper: Distributed Index for Semantic Search by Peter Mika
Presenter: Peter Mika
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/indexing
- Describe the process of building indices for semantic search using MapReduce. Comparing two RDF index structures
- IR relies on inverted indices
- MapReduce is perfect model for building inverted indices
- Map creates the (term, {doc1}) pairs
- Reduce collects all the docs for the same term (term, {doc1, doc2})
- Skew is a known issue: reduces have uneven load (hi frequence terms)
- Sub-indices are merged afterwards (inexpensive)
- Implementation for building Lucene indices on Hadoop - Katta Project
- Rdf has much richer structure (more expressive queries require more sophisticated indices)
- Differences in semantic search lit. as to what expressivity is required
- Pound et. al WWW2010
- Users unlikely to type SPARQL queries
- Queries on property values are requires in almost all cases
- Simple cheap solution; Post-fixing
- Append the name of the verb to the end of the entity
- Good: there is less skew
- Bad: Dictionary is number of unique terms (this explodes)
- Horizontal Indexing: two fields(index) 1 terms, 1 for properties
- Good for dictionary, occurrences is number of tokens *2
- As much skew as in normal text indexing
- Vertical indexing: One field(index) per property
- Good for Dictionary, Occurrences is number of tokens, less skew
- Bad:
- More complex than the textbooks would like you to believe
- Need to hash docids : used MG4J Minimal perfect Hash (306MB for billions of docs)
- Posting list needs to be sorted by docid
- Used the BTC 2009 data set
- Horizontal index structure is more efficient for keyword queries and field restricts
- Indexing costs:
- Number of reducers can be chosen based on trade offs of too many reduces or too many mappers
Paper: Dear Search Engine: What’s your opinion about...? - Sentiment Analysis for Semantic Enrichment of Web Search Results by Gianluca Demartini and Stefan Siersdorfer
Presenter: Gianluca Demartini
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/dear
- Controversial topics are being discussed on the web
- Search engine can bias (if they wanted) the way we see the web
- It is important to provide a good overview of top-N results for both topic and sentiment
- Contribution: approaches for computing sentiment of web pages
- Trained classifier on a Movie Review Classifier
- So what is an ideal ranking of the sentiment for controversial search results
- Possible results:
- Balances Overview (+1 and -1 docs)
- Neutral Overview (0 docs)
- Realistic Overview (80% docs)
- Personalized Overview (use user profile)
- Extraction of sentiment classification of web pages
- Use a lexicon of sentimental workds (SentiWordNet)
- Compared 14 different queries on 3 search engines top-1 to 50 results...
- Average sentiment is very close to zero for every search engine.
- "The first result is usually favorable for a topic"
- "Average sentiment about employment is greater than the average sentiment of marijuana"
- Quality of extraction/annotations was not studied
- Determine if the top N results are good sample?
- Several interesting applications
- A better training set could be the TREC dataset
Paper: Automatic Modeling of User's Real World Activities from the Web for Semantic IR by Yusuke Fukazawa and Jun Ota
Presenter:
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/modelling.pdf
- Interested in an infrastructural mobile semantic IR
- Investigated the automatic modeling of users' real world activities from the web. (Based on the movies)
- They try to automatically model a users's real world activity (based on twitter of blog)
- Previous people did not extract hierarchical information form the web.
- there is a ontological structure that represents of a Domain, Parent Task, and Child Tasks
- For example, what is the child task "Watch movie" vs. "Make movie"
- Apply idea of PMI-IR to solve this
- PMI-IR = hits(P and A)/ hits(A)*hits(B) -- Hits is search engine hits
- It is difficult to produce a measure between the same tasks and different. We want mostly exact matches between same and different tasks.
- Method 3 in the paper was able to acquire 80% of hierarchical relationships
- This needs to be tested on alternative domains.
Paper: The Wisdom in Tweetonomies: Acquiring Latent Conceptual Structures from Social Awareness Streams by Claudia Wagner and Markus Strohmaier
Presenter: Claudia Wagner
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/tweet.pdf
- Social awareness streams (SAS) - aggregation of short natural language messages by users
- Tweetonomies - the latent structures that appear from many users
- name from Taxonomy and Folxonomies
- Do tweetonomies exist?
- Structure os SAS
- Users, messages, content of messages
- content: URLS, hashtags, etc.
- Information from App developers
- SAS model (see paper)
- User nodes, message nodes, resource nodes (all with qualifiers)
- Relationship between message content and users
- Use stream measures and network transformations
- Data set was from twitter stream aggregation during certain times.
- See paper for several Structural Stream measures. They are models to understand streams
- Results
- Difference types of stream aggregations influence stream structures
- hash-tag streams are more robust against external disturbances than user-list streams
- hash tags are good context indicators
- Resource-hashtag networks reveal good latent conceptual structures
- Conclusions
- The emergence of tweetonomies is dependent on the type of networks modeled
Paper:A Large-Scale System for Annotating and Querying Quotations in News Feeds
Jisheng Liang by Navdeep Dhillon and Krzysztof Koperski
Presenter: Jisheng Liang
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/newsfeeds.pdf
- Try to extract annotations in news/blog feeds to get semantic information.
- Indexing these annotations
- Build Efficient, query-able, search for this
- Quotation Search
- What did
- Speaker/Subject can be specified as:
- Specific entity, Facet - catetory ...
- Document processing --> Clause Processing --> Normalization and Annotation
- Indexed triples (SVO triples) Not currently RDF triples
- The triples have links to the entity research
- there are currenlt 2 million entities (people place, organizations ...)
- Sources Wikipedia and Structures/semi-structures sources (crunchbase, amazon, ...)
- This entity store is being expanded continuously
- Auto matic detection and import of new entities (susan boyl)
- Need Live updates for entities (i.e. sport player traded)
- Entites Properties
- Unique identifier
- type description, synonyms and aliases, type and facet specific properties
- relation properties (i.e. bball player is linked to ites team/league)
- Quote Extraction
- detect verbs, verify verb subject is a person, check quotation marks (could be multiples sentences)
- Coreference Resolutions --> Pronouns e.g. He said, Aliases (said Gates) etc...
- Entity disambiguation
- Identify speaker, mark-up boundry of quote
- Quote is indexes
- Triple speaker, Verb, Quote
- Each triples is a lucene document
- combine lucene + semantic Annotation
- keywords also stored for search
- Example facets (President, country leader)
- Accessible using rest api's -- evri.com
- 10 million quotes
- 60K quotes added each day
- .5 Billion SVO triples
- Query Execution Avg 159ms, median 59ms
- Future Work
- They plan on improving relevance ranking
- Identify Pull quote "The article the editor pulled out"
- Sentiment analysis.
Paper: Semantically Enabled Exploratory Video Search by Joerg Waitelonis, Harald Sack, Zalan Kramer and Johannes Hercher
Presenter: Joerg Waitelonis
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/video.pdf
- Keyword based search is a problem because you should always know what you are looking for
- Doc similarity based on the syntax level
- Explority search
- User is not familiar with the domain
- not sure how to reach search destination
- not really sure about what she is looking for
- Users find results accidentally - Serendipitous findings
- ex: Presidents are linked to a 'battle'
- Created video search engine called yovisto
- use DBpedia, foaf, dublincore, mpeg-7, tagging
- After search, you can get a previous of particular parts of the video
- search terms are mapped to dbpedia entries
- There is ranking/filtering for related entities, heuristics
- rdf-graph-based
- more relevant is the graph structure of two entities are similar
- statistical/linguistic-based
- There are still many questions to be answered, not perfect for all users at all time
- Quality-based Eval
- 19 persons, 124 search tasts, 1489 search queries
- 699 without/790 with exploratory search feature
- Surveys show that users feel motivated to find the actual answer when using exploratory tools. (97% vs 82%)
Paper: Entity Search: Building Bridges between Two Worlds by Krisztian Balog, Edgar Meij and Maarten de Rijke
Presenter: Krisztian Balog
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/entity.pdf
- This is a position paper
- Entity search is important
- Being looked at by both IR and SW community
- Entity search tasks
- Entity ranking
- List completion
- Related entity finding
- Where are we now?
- IR
- Identifying and ranking entities in large volumes of data
- Mostly based on co-occurrences between terms and entities
- Generated models are not always meaningful for human consumption
- SW
- Structured data, naturally organized around entities
- Entity retrieval is as simple as running SPARQL queries?
- Free-text querying is more appealing to (naive) end user
- Related entity finding
- Given: input entity (E), Type (T), narrative(R)
- E = name + homepage
- T = product, organization,
- R = relationship between product
- AIM: compare IR and SW approaches on the related entity finding tasks
TREC Entity 2009 maps soure entity to a Wikipedia page (17/20)
- Tasks is to capture all common entities
- IR approach
- Query, document snippet retrieval, answer candidate extraction, answer candidate [type] filtering, answer candidate ranking, Answer
- SW approaches
- Use sparql query for entity then an exhaustive graph search.
- Most relations returned are by SW methods are of type wikilinks
- Summary
- IR is useful, needs labels
- SW, has capability to produce large amount of data, but LOW is very sparse between entities
- Enhance text-based models with semantic information from LOD
- Use IR models to discover and label links between
Paper: Methodology and Campaign Design for the Evaluation of Semantic Search Tools by Stuart Wrigley, Dorothee Reinhard, Khadija Elbedweihy, Abraham Bernstein and Fabio Ciravegna
Presenter: Stuart Wrigly
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/eva.pdf
- Idea of SEALS is to develop and diffuse best practices in evaluation of semantic technologies
- Create a lasting reference infrastructure for semantic technology evaluation (SEALS platform)
- Goal: Organize two worldwide Evaluation Campaigns
- Engineering toold, Storage are reasoning, ...
- You can use both their hardware and previous stuff
- Eval criteria
- Query Expressiveness
- Usability (effectiveness, efficiency, satisfaction)
- Scalability
- Quality of documentation
- Raw performance
- There is an Automated evaluation phase and a user in the loop phase
- User in the loop phase uses the Mooney Natural Language Learning Data
- EvOnto is a set of ontologies of 5 different data sizes (1K, 10K, ....10M)
- Questionnaire
- System Usability Scale - how you like the system (Lfikert scale)
- For automated version, they want to see system accuracy and performance
- User in the lookup phase - has several result data results
0 comments:
Post a Comment