Lab41 · tiffapedia · Jan 21, 2016 · Jan 22, 2016 · Jan 22, 2016 · Jan 23, 2016
diff --git a/.gitignore b/.gitignore
@@ -112,7 +112,7 @@ lib64
 __pycache__
 
 # The __init__.py's that scram puts everywhere
-__init__.py
+# __init__.py
 
 # Installer logs
 pip-log.txt

diff --git a/README.md b/README.md
@@ -1,6 +1,76 @@
-# hermes
+# Hermes 
+
 Hermes is Lab41's foray into recommender systems. It explores how to choose a recommender system for a new application by analyzing the performance of multiple recommender system algorithms on a variety of datasets.
 
-It also explores how recommender systems may assist a software developer of data scientist find new data, tools, and computer programs.
+It also explores how recommender systems may assist a software developer or a data scientist to find new data, tools, and computer programs.
+
+This readme will be updated as the project progresses so stay tuned!
+
+
+## Documentation 
+
+[Hermes Documentation](https://github.com/Lab41/hermes/tree/master/docs)
+
+
+## Basic Installation Guide 
+
+For a detailed installation guide, please read on [Hermes Installation Guide](https://github.com/Lab41/hermes/tree/master/docs/installation.md).
+
+### Dependencies: 
+* Spark 1.5.1 
+* Scala 2.11.7
+* Pyspark 0.8.2.1
+* Hadoop 2.7.1
+* virtualenv
+
+### Warning:
+We have dropped working on Hermes for the command line because the team has decided to pursue running Hermes on the Spark's iPython Notebook instead.
+
+### How to Install Hermes: 
+
+(Optional) After you have installed the dependencies, if you have different projects that require different Python environment, you can use a Virtual Environment. As listed in the Virtual Environment's [site](http://docs.python-guide.org/en/latest/dev/virtualenvs/), "a Virtual Environment is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them."
+
+```bash
+$ virtualenv name_of_your_virtualenv
+$ . name_of_your_virtualenv/bin/activate
+```
+
+To install Hermes, run 
+```bash
+$ python setup.py install
+```
+
+This will create a binary called hermes in /usr/local/bin/hermes. Instead of running the binary with the entire path (ie. ./usr/local/bin/hermes), you can install it so that you can run hermes without calling the entire path on the command line. 
+```bash
+$ pip install --editable .
+```
+
+Now, you can just run hermes the binary and it will prompt you with what you want to do with the data that you have. 
+```bash 
+$ hermes
+```
+
+## How to Run Hermes
+
+NOTE: Next implementation of Hermes will be set up so that it does not use pseudo-distributed mode in a single node cluster.
+
+For a detailed guide on how to run Hermes, please read on [How to Run Hermes](https://github.com/Lab41/hermes/tree/master/docs/run.md) guide.
+
+Hermes requires at least three arguments in order to run properly. 
+* fs_default_ip_addr: IP address of fs.default.name used in HDFS, ie. localhost:9000.
+* list_of_files_config: A configuration file that lists all the json paths referenced by configs.
+* configs: Users can provide an unlimited amount of configuration files that list what datasets to use and which recommender algorithms and metrics to apply to each dataset.
+
+With one configuration file:
+```bash
+$ hermes localhost:9000 ./hermes/configs/list_of_files.ini ./hermes/configs/config1.ini 
+```
+
+With more than one configuration files:
+```bash
+$ hermes localhost:9000 ./hermes/configs/list_of_files.ini ./hermes/configs/config1.ini ./hermes/configs/config2.ini
+```
+
+## State of Build 
 
-This readme will be updated as the project progresses so stay tuned!
+It is currently in progress. We will show the progress of the build using TravisCI once it is established.
diff --git a/docs/assumptions.md b/docs/assumptions.md
@@ -0,0 +1,70 @@
+# Assumptions
+
+* [Assumptions on Execution](#assumptions-on-execution)
+* [Assumptions on Vector Creation](#assumptions-on-vector-creation)
+* [Assumptions on Directory Creation](#assumptions-on-directory-creation)
+
+## Assumptions on Execution
+
+Here is an example file called config.ini.
+
+```bash
+[datasets]
+dataname = movielens
+
+# user vector
+user_vector_data = ["movielens_10m_ratings", "movielens_20m_ratings"]
+user_vector_schemas = ["movielens_10m_ratings_schema", "movielens_20m_ratings_schema"]
+user_vector_transformations = ["ratings", "ratings_to_interact"]
+
+# content vector
+content_vector_data = ["movielens_10m_movies"]
+content_vector_schema = ["movielens_10m_movies_schema"]
+content_vector_transformations = ["genre"]
+
+[recommenders]
+user_recommenders = ["ALS"]
+content_recommenders = ["CBWithKMeans"]
+
+[metrics]
+metrics = ["RMSE", "MAE"]
+```
+
+When you specify the following configuration, the assumption that we make during execution is as follows:
+* each transformation is applied in sequential order to the data, meaning
+  * user_vector_transformation "ratings" is applied to "movielens_10m_ratings" and "movielens_10m_ratings_schema"
+  * user_vector_transformation "ratings_to_interact" is applied to "movielens_20m_ratings" and "movielens_20m_ratings_schema"
+  * content_vector_transformation "genre" is applied to "movielens_10m_movies" and "movielens10m_movies_schema"
+* user_recommenders take in a list of recommender algorithms that will be applied to all user_vector_data, meaning
+  * apply ALS to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings"
+  * apply ALS to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings_to_interact"
+* content_recommenders take in a list of recommender algorithms that will be applied to all content_vector_data, meaning
+  * apply CBWithKMeans to a Content Vector of movielens_10m_movies that have been transformed by vector transformation "genre"
+* metrics take in a list of metrics that will be applied to all data, including both user_vector_data and content_vector_data, after recommender algorithms have been applied to them, meaning
+  * apply RMSE to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings" and recommendation system algorithm ALS
+  * apply RMSE to a USer Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings_to_interact" and recommedation systme algorithm ALS
+  * apply RMSE to a Content Vector of movielens_10m_movies that have been transformed by vector transformation "genre" and recommendationi system algorithm CBWithKMeans
+  * apply MAE to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings" and recommendation system algorithm ALS
+  * apply MAE to a USer Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings_to_interact" and recommedation systme algorithm ALS
+  * apply MAE to a Content Vector of movielens_10m_movies that have been transformed by vector transformation "genre" and recommendationi system algorithm CBWithKMeans
+
+## Assumptions on Vector Creation
+
+Each dataset is unique in that transforming JSON to RDD is different for each dataset. This step is implemented in vectorgenerator.py. When we separate the implementation of vector generation of each dataset into individual files in the hermes/hermes/modules/vectors directory, each of these files need to import vectorgenerator.py in this specific manner: 
+
+```bash
+from hermes.modules.vectorgenerator import UserVector, ContentVector
+```
+
+The reason for this is during the instantiation of the vector object in the VectorFactory class. When we specify which vector to create, it is either a UserVector or a ContentVector class; both of which are instantiated in vectorgenerator.py, and vectorgenerator.py as a module is hermes.modules.vectorgenerator. 
+
+Since we can no longer use the __subclasses__() function to iterate through all children of UserVector class or all children of ContentVector class in order to instantiate the right vector because the children are now defined in a separate module in hermes/hermes/modules/vectors directory, we have to load all modules and go through each class in each module to know all children of a UserVector or ContentVector class. Unfortunately, if you defined the import statement as "from modules.vectorgenerator" instead of "from hermes.modules.vectorgenerator", it does not think the two modules are the same even though they are. 
+
+We have yet to determine why this is the case. 
+
+When users add a new dataset, we cannot always assume that they will import exactly as "from hermes.modules.vectorgenerator import UserVector, ContentVector" because they can import it as "from modules.vectorgenerator import UserVector, ContentVector" since it is valid. For this reason, we have made an assumption that if the parent class of the MovieLensUserVector, for example, has the __name__ UserVector, MovieLensUserVector is the child of UserVector. The problem of this assummption is that if MovieLensUserVector inherits multiple parents from different module with the same class name, it can become a problem as it will treat both parents with the same class name as the same. 
+
+
+## Assumptions on Directory Creation
+
+We made an assumption that there is only one directory with the label "vg", "rg", and "mg". These directories store the modules for vector, recommender, and metric creation specific to either datasets or use cases. The assumption is made in the helper function load_modules_in_zip() where it checks for the base directory of the file path if the base directory is  "vg", "rg", or "mg" to load the modules in the notebook during vector, recommender, or metric creation respectively.