Friday, June 25, 2010

Why did Jesus have to Suffer?

I was listening to Mike Patz from First Assembly of God in Gainesville answer questions.  Although I am still watching, I had to pause and write a post because one particular answer was brilliant.

The question was one that I wondered my several times:  "Why did Jesus have to suffer before he died?" Why couldn't he just be shot, or drown, or have a quick death to quickly die.

In Matthew 16, Luke 22 and Mark 8, Jesus tells his disciples that he must suffer many things before he is killed.  The basic reasoning behind the need for suffering, as explain by Mike is that suffering is a requirement for forgiveness.  For example, when someone takes something that you value and destroys it; even though you forgive them, you suffer because you no longer have that item of value.  

As another example, imagine a rape victim who is able to forgive her assailant.  Because she forgave, she must absorb a lot of suffering incurred through the calamity.

In the same way, in order for Jesus to truly forgive the sins of the world, the very nature of forgiveness required an element suffering.  Because he suffered "many things" we can be sure that He was really forgiving us.  And because he died we know that are sins are really paid for, because the bible says that the "Wages of sin is death".

The more I understand the testimony of Jesus, the more I love him.

* You can watch the rest of the sermon on the church's media website.

Monday, April 26, 2010

SemSearch 2010 Notes

3rd edition of the Semantic Search Workshop
 These are notes on the talks. To all the presenters I apologize if I misrepresented any information.  Let me know I will make updates.

Slides will be available at videolectures.net shortly in the mean time check out the #semsearch2010 twitter stream.
I encourage everyone to look at the papers linked to the presentations.  Additionally, Jeff Dalton has a post on the semsearch conference.

Update: Krisztian Balog has posted notes on his blog.

Thanh Tran Duc
Semantic search -> use semantics to enhance any/all of the search steps

Keynote: Barney Pell - Search Strategist and evangelist @ Bing
- Semantics is an 'Opportunity'
- 50% of people rely on search on a typical day
- Satisfaction of search is not going up
- 25% of queries result in quick click backs (User thought it was good but it may not have been for them)
- 42% of sessions need refinement (what specifically user wants)
- s.n. Its good to view search as a session instead of tasks
- 50% of the time searching is on long queries, "Lengthy Tasks"
- Example long query:
- 10 Unique queries, 7 partial re-queries and refinements, 57 minutes (medical)
- 11 Unique queries, 5 partial queries, 33 minutes (travel examples)
- There are opportunities for innovation
- Entity Centered Experiences
- A lot of time is spent on entity centered Experiences
- At Bing, there is entity Cards
- However, Entities are ambiguous.
- Even if you know the entity, you need to uniquely find it
- Need methods for extracting and synthesizing knowledge of disparate content
- Instead of just related search, make them related entities (in Bing)
- See: search Lindsay Vonn
- Semantic Improvements to Core Search
- Semantic Retrieval & Ranking
- Better Entity tagging (Barack Obama. President Obama, the president, he)
- Derive graphs from text
- Semantic Query Understanding
- Presentation and Captions
- Match the meaning of the query not just keywords
- Respect the boundary of the content
- Example:
- ...85% of the population suffers from omega 3 fish oil...
- ...85% of the population sufferer from omega 3 fish oil deficiency...
- Smart Summarization
- Could be too long, Concise, or Concise but misleading
- The highlighted answers include word variations (mocked : parodied)
- Use captions with user data, restaurant, map, reviews, etc..
- Directly Wrap/process structured websites in search (not blindly)
- Faceted search
- Preprocess foods using an open nutritional DB
- Answers and Question Answering
- Structured/Unstructured data
- Medstory.com pre-process pages to view related information my its relevance
- A conversational assistant that allows you to use access different accounts.  Example - say you want reservations, its access OpenTable
- Summary
- Deliver great Results
- Richer more organized experience
- Help user accomplish tasks easily
- Conclusions
- User needs have evolved, Data and services are proliferating, search innovations are going live daily
- Bing Demo
- Bing travel "RDU to SFO"
- Restaurants in Raleigh
- You can see reviews
- Entity Cube.com
- Developers of YAGO are now part of Microsoft
- They can take two entities and find the nearest connection of the social graph
* Research: How to search with multiple entities.
* "BMW and Mercedes"  = "BMW vs Mercedes"
* What are the other relationships we can exploit between entities?
* What if we have more than one entity?



Paper: Paraphrasing Invariance Coefficient: Measuring Para-Query Invariance of Search Engines by Tomasz Imielinski and Jinyun Yan
Presenter: Jinyun Yan
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/para

- Are the two questions "What is the population of Raleigh" and "How many people live in Raleigh" the same?
- Users ask search engines nicely --> Search engines are still quite sensitive to how the question is formulated.  With humans this is not the case!
- Semantic - Understand the meaning behind words
- The search engine should recognize the pair of queries and return the same result.
- Search Metric:
- Semantic invariance -- be invariant to semantically equivalent queries
- Equivalent Queries => para-queries
- How to generate para queries?
- This is a subclass of paraphrase generation, but of course para-query has its own characteristics (e.g. short, few content words)
- Extra information
- Query Reformulation won't promise equivalent meaning.
- Created the Rephraser game
- 430 rounds 15 min each
- Input a start query, a hidden phrase (not visible to players)
- Goal : paraphrase to start query to gain score
- Created as a multilayer racecar game, but players play the game independently
- Use players votes to do the score
- You can use templates with argument slots
- Who is the governor of [X]?
- Para-Queries are searches that have the same top-K url returned by search engine.
- In test the results are low for para-query detection
- Conclusion:
- Current search engines are far away from semantic because they do not recognize similar queries.
- Suggestion
- Measure search quality on paraqueries to ensure a retyped query isn't of less value



Paper: Using BM25F for Semantic Search by José R. Pérez-Agüera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias and Victor Fresno
Presenter: José R. Pérez-Agüera
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f

- Keyword-based semantic search -- has become major research area, garnering much attention in the semantic web over the last seven years.
- Is it possible to improve quality results in terms of relevance applying just classic IR approaches to RDF semantic structure?
- "We wrote this paper for ourselves because we were interested, we thought the community would benefit"
- Main problems: Indexing RDF triples using inverted indexes, Ranking based retrieval for RDF objects
- Store RDF triples in inverted indexes or represent Subject Predicate Object in an n X m matrix
- We modify this in SEMPLORE model what uses the row text of the entity and fields.
- SEMPLORE is basis of the index model
- For a long time search engines ave been dealing with flat documents (ie xml)
- Consequence is that all the terms have the same relevance (bag-of-words)
- Structures IR - location of word gives term relavance and possible enhanced meaning (i.e. boost factors depending on the system)
- Robertson et al 2004 --> The linear combination of weights for each field of the document is not enough if a saturation, like log(tf) or sqrt(tf) is used in the TF function.
- Because just because something that occure 25 more times, doesnt meant it is 25 times more important.  You need a log or sqrt curve over the term frequency
- Additional, filters harm this saturation effect, such as in lucene.
- INEX - XML retreiveal competiton
- Lucene is not used in IR conferences, there are both strong and week points. It has a type of boolean retrieval
- Performance using Lucene with added Fields is worse that using Lucene with just a large bag of words approach.
- Conclusions:
- Dont use Lucence ranking function
- There is no good ranking model for semantic search, BM25F is probably the best



Paper: Distributed Index for Semantic Search by Peter Mika
Presenter: Peter Mika
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/indexing

- Describe the process of building indices for semantic search using MapReduce. Comparing two RDF index structures
- IR relies on inverted indices
- MapReduce is perfect model for building inverted indices
- Map creates the (term, {doc1}) pairs
- Reduce collects all the docs for the same term (term, {doc1, doc2})
- Skew is a known issue: reduces have uneven load (hi frequence terms)
- Sub-indices are merged afterwards (inexpensive)
- Implementation for building Lucene indices on Hadoop - Katta Project
- Rdf has much richer structure (more expressive queries require more sophisticated indices)
- Differences in semantic search lit. as to what expressivity is required
- Pound et. al WWW2010
- Users unlikely to type SPARQL queries
- Queries on property values are requires in almost all cases
- Simple cheap solution; Post-fixing
- Append the name of the verb to the end of the entity
- Good: there is less skew
- Bad: Dictionary is number of unique terms (this explodes)
- Horizontal Indexing: two fields(index) 1 terms, 1 for properties
- Good for dictionary, occurrences is number of tokens *2
- As much skew as in normal text indexing
- Vertical indexing: One field(index) per property
- Good for Dictionary, Occurrences is number of tokens, less skew
- Bad:
- More complex than the textbooks would like you to believe
- Need to hash docids : used MG4J Minimal perfect Hash (306MB for billions of docs)
- Posting list needs to be sorted by docid
- Used the BTC 2009 data set
- Horizontal index structure is more efficient for keyword queries and field restricts
- Indexing costs:
- Number of reducers can be chosen based on trade offs of too many reduces or too many mappers



Paper: Dear Search Engine: What’s your opinion about...? - Sentiment Analysis for Semantic Enrichment of Web Search Results  by Gianluca Demartini and Stefan Siersdorfer
Presenter: Gianluca Demartini
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/dear

- Controversial topics are being discussed on the web
- Search engine can bias (if they wanted) the way we see the web
- It is important to provide a good overview of top-N results for both topic and sentiment
- Contribution: approaches for computing sentiment of web pages
- Trained classifier on a Movie Review Classifier
- So what is an ideal ranking of the sentiment for controversial search results
- Possible results:
- Balances Overview (+1 and -1 docs)
- Neutral Overview (0 docs)
- Realistic Overview (80% docs)
- Personalized Overview (use user profile)
- Extraction of sentiment classification of web pages
- Use a lexicon of sentimental workds (SentiWordNet)
- Compared 14 different queries on 3 search engines top-1 to 50 results...
- Average sentiment is very close to zero for every search engine.
- "The first result is usually favorable for a topic"
- "Average sentiment about employment is greater than the average sentiment of marijuana"
- Quality of extraction/annotations was not studied
- Determine if the top N results are good sample?
- Several interesting applications
- A better training set could be the TREC dataset



Paper: Automatic Modeling of User's Real World Activities from the Web for Semantic IR by Yusuke Fukazawa and Jun Ota
Presenter:
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/modelling.pdf

- Interested in an infrastructural mobile semantic IR
- Investigated the automatic modeling of users' real world activities from the web. (Based on the movies)
- They try to automatically model a users's real world activity (based on twitter of blog)
- Previous people did not extract hierarchical information form the web.
- there is a ontological structure that represents of a Domain, Parent Task, and Child Tasks
- For example, what is the child task "Watch movie" vs. "Make movie"
- Apply idea of PMI-IR to solve this
- PMI-IR = hits(P and A)/ hits(A)*hits(B) -- Hits is search engine hits
- It is difficult to produce a measure between the same tasks and different. We want mostly exact matches between same and different tasks.
- Method 3 in the paper was able to acquire 80% of hierarchical relationships
- This needs to be tested on alternative domains.


Paper: The Wisdom in Tweetonomies: Acquiring Latent Conceptual Structures from Social Awareness Streams by Claudia Wagner and Markus Strohmaier
Presenter: Claudia Wagner
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/tweet.pdf

- Social awareness streams (SAS) - aggregation of short natural language messages by users
- Tweetonomies - the latent structures that appear from many users
- name from Taxonomy and Folxonomies
-  Do tweetonomies exist?
- Structure os SAS
- Users, messages, content of messages
- content: URLS, hashtags, etc.
- Information from App developers
- SAS model (see paper)
- User nodes, message nodes, resource nodes (all with qualifiers)
- Relationship between message content and users
- Use stream measures and network transformations
- Data set was from twitter stream aggregation during certain times.
- See paper for several Structural Stream measures.  They are models to understand streams
- Results
- Difference types of stream aggregations influence stream structures
- hash-tag streams are more robust against external disturbances than user-list streams
- hash tags are good context indicators
- Resource-hashtag networks reveal good latent conceptual structures
- Conclusions
 - The emergence of tweetonomies is dependent on the type of networks modeled



Paper:A Large-Scale System for Annotating and Querying Quotations in News Feeds
Jisheng Liang by Navdeep Dhillon and Krzysztof Koperski
Presenter: Jisheng Liang
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/newsfeeds.pdf

- Try to extract annotations in news/blog feeds to get semantic information.
- Indexing these annotations
- Build Efficient, query-able, search for this
- Quotation Search
- What did say about ?
- Speaker/Subject can be specified as:
- Specific entity, Facet - catetory ...
- Document processing --> Clause Processing --> Normalization and Annotation
- Indexed triples (SVO triples) Not currently RDF triples
- The triples have links to the entity research
- there are currenlt 2 million entities (people place, organizations ...)
- Sources Wikipedia and Structures/semi-structures sources (crunchbase, amazon, ...)
- This entity store is being expanded continuously
- Auto matic detection and import of new entities (susan boyl)
- Need Live updates for entities (i.e. sport player traded)
- Entites Properties
- Unique identifier
- type description, synonyms and aliases, type and facet specific properties
- relation properties (i.e. bball player is linked to ites team/league)
- Quote Extraction
- detect verbs, verify verb subject is a person, check quotation marks (could be multiples sentences)
- Coreference Resolutions --> Pronouns e.g. He said, Aliases (said Gates) etc...
- Entity disambiguation
- Identify speaker, mark-up boundry of quote
- Quote is indexes
- Triple speaker, Verb, Quote
- Each triples is a lucene document
- combine lucene + semantic Annotation
- keywords also stored for search
- Example facets (President, country leader)
- Accessible using rest api's -- evri.com
- 10 million quotes
- 60K quotes added each day
- .5 Billion SVO triples
- Query Execution Avg 159ms, median 59ms
- Future Work
- They plan on improving relevance ranking
- Identify Pull quote "The article the editor pulled out"
- Sentiment analysis.



Paper: Semantically Enabled Exploratory Video Search by Joerg Waitelonis, Harald Sack, Zalan Kramer and Johannes Hercher
Presenter: Joerg Waitelonis
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/video.pdf

- Keyword based search is a problem because you should always know what you are looking for
- Doc similarity based on the syntax level
- Explority search
- User is not familiar with the domain
- not sure how to reach search destination
- not really sure about what she is looking for
- Users find results accidentally - Serendipitous findings
- ex: Presidents are linked to a 'battle'
- Created video search engine called yovisto
- use DBpedia, foaf, dublincore, mpeg-7, tagging
- After search, you can get a previous of particular parts of the video
- search terms are mapped to dbpedia entries
- There is ranking/filtering for related entities, heuristics
- rdf-graph-based
- more relevant is the graph structure of two entities are similar
- statistical/linguistic-based
- There are still many questions to be answered, not perfect for all users at all time
- Quality-based Eval
- 19 persons, 124 search tasts, 1489 search queries
- 699 without/790 with exploratory search feature
- Surveys show that users feel motivated to find the actual answer when using exploratory tools. (97% vs 82%)


Paper: Entity Search: Building Bridges between Two Worlds by Krisztian Balog, Edgar Meij and Maarten de Rijke
Presenter: Krisztian Balog
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/entity.pdf

- This is a position paper
- Entity search is important
- Being looked at by both IR and SW community
- Entity search tasks
- Entity ranking
- List completion
- Related entity finding
- Where are we now?
- IR
- Identifying and ranking entities in large volumes of data
- Mostly based on co-occurrences between terms and entities
- Generated models are not always meaningful for human consumption
- SW
- Structured data, naturally organized around entities
- Entity retrieval is as simple as running SPARQL queries?
- Free-text querying is more appealing to (naive) end user
- Related entity finding
- Given: input entity (E), Type (T), narrative(R)
- E = name  + homepage
- T = product, organization,
- R = relationship between product
- AIM: compare IR and SW approaches on the related entity finding tasks
TREC Entity 2009 maps soure entity to a Wikipedia page (17/20)
- Tasks is to capture all common entities
- IR approach
- Query, document snippet retrieval, answer candidate extraction, answer candidate [type] filtering, answer candidate ranking, Answer
- SW approaches
- Use sparql query for entity then an exhaustive graph search.
- Most relations returned are by SW methods are of type wikilinks
- Summary
- IR is useful, needs labels
- SW, has capability to produce large amount of data, but LOW is very sparse between entities
- Enhance text-based models with semantic information from LOD
- Use IR models to discover and label links between



Paper: Methodology and Campaign Design for the Evaluation of Semantic Search Tools by Stuart Wrigley, Dorothee Reinhard, Khadija Elbedweihy, Abraham Bernstein and Fabio Ciravegna
Presenter: Stuart Wrigly
URL: http://km.aifb.kit.edu/ws/semsearch10/Files/eva.pdf

- Idea of SEALS is to develop and diffuse best practices in evaluation of semantic technologies
- Create a lasting reference infrastructure for semantic technology evaluation (SEALS platform)
- Goal: Organize two worldwide Evaluation Campaigns
- Engineering toold, Storage are reasoning, ...
- You can use both their hardware and previous stuff
- Eval criteria
- Query Expressiveness
- Usability (effectiveness, efficiency, satisfaction)
- Scalability
- Quality of documentation
- Raw performance
- There is an Automated evaluation phase and a user in the loop phase
- User in the loop phase uses the Mooney Natural Language Learning Data
- EvOnto is a set of ontologies of 5 different data sizes (1K, 10K, ....10M)
- Questionnaire
- System Usability Scale - how you like the system (Lfikert scale)
- For automated version, they want to see system accuracy and performance
- User in the lookup phase - has several result data results

Wednesday, April 7, 2010

UF CISE Email is Leveling the Playing Field

There is usually a huge advantage to obtaining solutions to homework questions from previous years. This is a an advantage to joining fraternities and other social circles.  While students who were outside of these groups struggle to complete assignments, the people with the previous answer sets would be studying the solutions. This puts a large body of students at a disadvantage. I know there are similar practices in Medical and Dental schools.

Today, I was proud to receive an email from our department chair that encourages professors to post solutions from previous semesters to make the playing ground a little more level.

Several years ago, the CISE faculty agreed to post all assigned coursework (homeworks, exams, solutions, etc) on course Web sites so as to level the playing field for our students. You will recall that we were getting several complaints about some students having access to past course material gathered by their friends/fraternities/sororities while other students were denied such access. Unfortunately, many of us have not implemented this decision. Reuse of past assignment questions, projects, exam question etc. encourages cheating and gives some students an unfair advantage over other students.
I urge everyone to implement the decision made by the CISE faculty several years ago to post all couse work along with solutions. We need to level the playing field for our students. At the very least, please do not reuse assessment instruments.

This is a step in the right direction to preserve quality of graduating students. The reuse of homework/exams is a likely reason Jeff Atwood says so many programmers can't program.

Wednesday, March 17, 2010

Google Chrome on Linux causing crashes

If you are like me and use google chrome for ubuntu 9.10 or some other debian linux, you have been experiencing errors when updating for a while.

When you run the update you may get:
WARNING: The following packages cannot be authenticated! packagename
Another symptom may be that the update was hanging at 99% "waiting for headers".

This is because apt-get is trying to validate your installed repository.

There is supposed to be a fix for this, here: http://www.google.com/linuxrepositories/apt.html

It works after hanging for a while