WikiSearch

=== Programs === WikiSearchMain, WikiXMLParser, Retriever, Stemmer, WikiPageIndexer

=== Author === Gagandeep Chhabra

=== Description === The programs here are used to create the index on the input XML file (Wikipedia XML Dump) and then search on the corpus. Basically its an offline serach tool for wikipedia XML dump.

=== Technology === Java

=== Explanation Of Implementation ===

To Reduce the size of the Index the Page Ids and pageLengths are stored in Hexadecimal format(considerable amount of space is saved).
Also three levels of Index is created. Primary Index has entire data, Secondary Index has Interleaved data from Primary and finally Tertiary Index has interleaved data from Secondary Index.
When parsing a page from Corpus, a Forward Index is created. Forward Index has page Id, corresponding page length and the title of it. This is useful when normalizing the weight in retrieval.
Also static precedence has been given to fields like TITLE is given max value of 5, content of Info box has weight of 4 and so on.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
README.md		README.md
stopWordList		stopWordList

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSearch

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WikiSearch

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages