Skip to content
This repository was archived by the owner on Dec 21, 2017. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
b22d525
prfs with RDD and meta perform metrics
bethke Jan 21, 2016
84afe8d
performance dict update
bethke Jan 22, 2016
e0c9b95
save-load to hadoop file system
bethke Jan 22, 2016
080777b
minor bug fix
bethke Jan 23, 2016
b47cb09
OSM to JSON fixes
bethke Jan 26, 2016
53ae05c
initial osm data vectorizer
bethke Jan 26, 2016
b77460d
coverage calc modifications
bethke Jan 27, 2016
3b0e7c4
Merge pull request #1 from Lab41/master
Jan 27, 2016
73128a9
pass in sqlCtx for performance metrics
bethke Jan 27, 2016
1853f2f
Merge pull request #27 from bethke/master
Jan 27, 2016
9625d0a
Fix a bug with non-ascii characters in names
Jan 28, 2016
3b8bb04
Merge pull request #28 from agude/git_to_json_fix
agude Jan 28, 2016
950ee23
implemented framework but with a hardcoded recommender and metric
Dec 2, 2015
871242b
implemented a working base framework
Dec 10, 2015
e2ffd6b
complete READMEs
Dec 17, 2015
7d06631
fix rebase with bookcrossing
Jan 4, 2016
d41849f
revise to raise NotImplemented error for functions not yet implemente…
Jan 4, 2016
6e9bdf6
fix logging for INFO that prints twice
Jan 4, 2016
d33026d
change vectorizer to dataname to be more clear
Jan 4, 2016
55ebd99
rename genre_dataname() to get_genre() for clarity
Jan 5, 2016
d9eb4e1
add more glossary terms
Jan 5, 2016
2cc02ee
separate vector generation of individual datasets in separate files
Jan 6, 2016
3df20a8
document assumptions
Jan 7, 2016
6257af9
separate recommender generation of different use cases in separate files
Jan 7, 2016
67e8cd1
rename files to provide more clarity
Jan 7, 2016
19e3c54
config differentiates recommenders based on the vector types
Jan 7, 2016
e683467
import submodules in __init__.py
Jan 12, 2016
2dd2421
remove sqlCtx in vectorgenerator because sqlCtx is a global variable
Jan 12, 2016
6300ad5
implement so that data class does not know about vector classes
Jan 12, 2016
9559c62
Data class does not need to know about the SparkContext
Jan 12, 2016
99be88a
add submodules in modules's __init__.py
Jan 12, 2016
58d2394
load module from zip file because in notebook, hermes is zipped up as a
Jan 12, 2016
08922ef
add assumptions about load_modules_in_zip()
Jan 13, 2016
785b1f3
use cf.py in recommendergenerator
Jan 13, 2016
79f8384
convert metrics directory to algorithms directory for consistency
Jan 13, 2016
3bf5cd6
wip: add uservector and contentvector in recommenders
Jan 14, 2016
72a2a29
wip: integrate framework into an iPython notebook including instructi…
Jan 25, 2016
f60cd15
Update using_notebook.md
Jan 25, 2016
7e33cb1
Update installation.md
Jan 25, 2016
cd193ba
Update using_notebook.md
Jan 25, 2016
cdc2e65
Update using_notebook.md
Jan 28, 2016
dad25db
Update README.md
Mar 16, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ lib64
__pycache__

# The __init__.py's that scram puts everywhere
__init__.py
# __init__.py

# Installer logs
pip-log.txt
Expand Down
76 changes: 73 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,76 @@
# hermes
# Hermes

Hermes is Lab41's foray into recommender systems. It explores how to choose a recommender system for a new application by analyzing the performance of multiple recommender system algorithms on a variety of datasets.

It also explores how recommender systems may assist a software developer of data scientist find new data, tools, and computer programs.
It also explores how recommender systems may assist a software developer or a data scientist to find new data, tools, and computer programs.

This readme will be updated as the project progresses so stay tuned!


## Documentation

[Hermes Documentation](https://github.com/Lab41/hermes/tree/master/docs)


## Basic Installation Guide

For a detailed installation guide, please read on [Hermes Installation Guide](https://github.com/Lab41/hermes/tree/master/docs/installation.md).

### Dependencies:
* Spark 1.5.1
* Scala 2.11.7
* Pyspark 0.8.2.1
* Hadoop 2.7.1
* virtualenv

### Warning:
We have dropped working on Hermes for the command line because the team has decided to pursue running Hermes on the Spark's iPython Notebook instead.

### How to Install Hermes:

(Optional) After you have installed the dependencies, if you have different projects that require different Python environment, you can use a Virtual Environment. As listed in the Virtual Environment's [site](http://docs.python-guide.org/en/latest/dev/virtualenvs/), "a Virtual Environment is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them."

```bash
$ virtualenv name_of_your_virtualenv
$ . name_of_your_virtualenv/bin/activate
```

To install Hermes, run
```bash
$ python setup.py install
```

This will create a binary called hermes in /usr/local/bin/hermes. Instead of running the binary with the entire path (ie. ./usr/local/bin/hermes), you can install it so that you can run hermes without calling the entire path on the command line.
```bash
$ pip install --editable .
```

Now, you can just run hermes the binary and it will prompt you with what you want to do with the data that you have.
```bash
$ hermes
```

## How to Run Hermes

NOTE: Next implementation of Hermes will be set up so that it does not use pseudo-distributed mode in a single node cluster.

For a detailed guide on how to run Hermes, please read on [How to Run Hermes](https://github.com/Lab41/hermes/tree/master/docs/run.md) guide.

Hermes requires at least three arguments in order to run properly.
* fs_default_ip_addr: IP address of fs.default.name used in HDFS, ie. localhost:9000.
* list_of_files_config: A configuration file that lists all the json paths referenced by configs.
* configs: Users can provide an unlimited amount of configuration files that list what datasets to use and which recommender algorithms and metrics to apply to each dataset.

With one configuration file:
```bash
$ hermes localhost:9000 ./hermes/configs/list_of_files.ini ./hermes/configs/config1.ini
```

With more than one configuration files:
```bash
$ hermes localhost:9000 ./hermes/configs/list_of_files.ini ./hermes/configs/config1.ini ./hermes/configs/config2.ini
```

## State of Build

This readme will be updated as the project progresses so stay tuned!
It is currently in progress. We will show the progress of the build using TravisCI once it is established.
70 changes: 70 additions & 0 deletions docs/assumptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Assumptions

* [Assumptions on Execution](#assumptions-on-execution)
* [Assumptions on Vector Creation](#assumptions-on-vector-creation)
* [Assumptions on Directory Creation](#assumptions-on-directory-creation)

## Assumptions on Execution

Here is an example file called config.ini.

```bash
[datasets]
dataname = movielens

# user vector
user_vector_data = ["movielens_10m_ratings", "movielens_20m_ratings"]
user_vector_schemas = ["movielens_10m_ratings_schema", "movielens_20m_ratings_schema"]
user_vector_transformations = ["ratings", "ratings_to_interact"]

# content vector
content_vector_data = ["movielens_10m_movies"]
content_vector_schema = ["movielens_10m_movies_schema"]
content_vector_transformations = ["genre"]

[recommenders]
user_recommenders = ["ALS"]
content_recommenders = ["CBWithKMeans"]

[metrics]
metrics = ["RMSE", "MAE"]
```

When you specify the following configuration, the assumption that we make during execution is as follows:
* each transformation is applied in sequential order to the data, meaning
* user_vector_transformation "ratings" is applied to "movielens_10m_ratings" and "movielens_10m_ratings_schema"
* user_vector_transformation "ratings_to_interact" is applied to "movielens_20m_ratings" and "movielens_20m_ratings_schema"
* content_vector_transformation "genre" is applied to "movielens_10m_movies" and "movielens10m_movies_schema"
* user_recommenders take in a list of recommender algorithms that will be applied to all user_vector_data, meaning
* apply ALS to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings"
* apply ALS to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings_to_interact"
* content_recommenders take in a list of recommender algorithms that will be applied to all content_vector_data, meaning
* apply CBWithKMeans to a Content Vector of movielens_10m_movies that have been transformed by vector transformation "genre"
* metrics take in a list of metrics that will be applied to all data, including both user_vector_data and content_vector_data, after recommender algorithms have been applied to them, meaning
* apply RMSE to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings" and recommendation system algorithm ALS
* apply RMSE to a USer Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings_to_interact" and recommedation systme algorithm ALS
* apply RMSE to a Content Vector of movielens_10m_movies that have been transformed by vector transformation "genre" and recommendationi system algorithm CBWithKMeans
* apply MAE to a User Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings" and recommendation system algorithm ALS
* apply MAE to a USer Vector of movielens_10m_ratings that have been transformed by vector transformation "ratings_to_interact" and recommedation systme algorithm ALS
* apply MAE to a Content Vector of movielens_10m_movies that have been transformed by vector transformation "genre" and recommendationi system algorithm CBWithKMeans

## Assumptions on Vector Creation

Each dataset is unique in that transforming JSON to RDD is different for each dataset. This step is implemented in vectorgenerator.py. When we separate the implementation of vector generation of each dataset into individual files in the hermes/hermes/modules/vectors directory, each of these files need to import vectorgenerator.py in this specific manner:

```bash
from hermes.modules.vectorgenerator import UserVector, ContentVector
```

The reason for this is during the instantiation of the vector object in the VectorFactory class. When we specify which vector to create, it is either a UserVector or a ContentVector class; both of which are instantiated in vectorgenerator.py, and vectorgenerator.py as a module is hermes.modules.vectorgenerator.

Since we can no longer use the __subclasses__() function to iterate through all children of UserVector class or all children of ContentVector class in order to instantiate the right vector because the children are now defined in a separate module in hermes/hermes/modules/vectors directory, we have to load all modules and go through each class in each module to know all children of a UserVector or ContentVector class. Unfortunately, if you defined the import statement as "from modules.vectorgenerator" instead of "from hermes.modules.vectorgenerator", it does not think the two modules are the same even though they are.

We have yet to determine why this is the case.

When users add a new dataset, we cannot always assume that they will import exactly as "from hermes.modules.vectorgenerator import UserVector, ContentVector" because they can import it as "from modules.vectorgenerator import UserVector, ContentVector" since it is valid. For this reason, we have made an assumption that if the parent class of the MovieLensUserVector, for example, has the __name__ UserVector, MovieLensUserVector is the child of UserVector. The problem of this assummption is that if MovieLensUserVector inherits multiple parents from different module with the same class name, it can become a problem as it will treat both parents with the same class name as the same.


## Assumptions on Directory Creation

We made an assumption that there is only one directory with the label "vg", "rg", and "mg". These directories store the modules for vector, recommender, and metric creation specific to either datasets or use cases. The assumption is made in the helper function load_modules_in_zip() where it checks for the base directory of the file path if the base directory is "vg", "rg", or "mg" to load the modules in the notebook during vector, recommender, or metric creation respectively.
Loading