Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 34 additions & 15 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,57 @@
# Contributing guide

## Installing daru development dependencies
## Ruby toolchain

Either nmatrix or rb-gsl are NOT NECESSARY for using daru. They are just required for an optional speed up and for running the test suite.
This fork uses MRI Ruby `4.0.1` for development and CI.

To install dependencies, execute the following commands:
```bash
mise trust
mise use ruby@4.0.1
bundle install
```

## Installing optional development dependencies

`nmatrix` and `rb-gsl` are optional acceleration backends. They are not required
for the default test suite.

Some integration suites depend on external services and native/system packages:

- SQL and ActiveRecord integration specs require a compatible sqlite stack.
- DBI integration specs require DBI + sqlite adapter compatibility.
- Rserve integration specs require an available Rserve daemon.
- Gruff specs require ImageMagick/rmagick dependencies.

Example Linux setup for the optional stacks:

``` bash
sudo apt-get update -qq
sudo apt-get install -y libgsl0-dev r-base r-base-dev
sudo Rscript -e "install.packages(c('Rserve','irr'),,'http://cran.us.r-project.org')"
sudo apt-get install libmagickwand-dev imagemagick
export DARU_TEST_NMATRIX=1 # for running nmatrix tests.
export DARU_TEST_GSL=1 # for running rb-GSL tests.
bundle install
```
You don't need `DARU_TEST_NMATRIX` or `DARU_TEST_GSL` if you don't want to make changes
to those parts of the code. However, they will be set in CI and will raise a test failure
if something goes wrong.

And run the test suite (should be all green with pending tests):
Run the default suite:

`bundle exec rspec`

`bundle exec rspec`
Run optional suites explicitly:

```bash
DARU_TEST_SQL=1 bundle exec rspec --tag sql
DARU_TEST_DBI=1 bundle exec rspec --tag dbi
DARU_TEST_RSERVE=1 bundle exec rspec --tag rserve
DARU_TEST_NMATRIX=1 bundle exec rspec --tag nmatrix
DARU_TEST_GSL=1 bundle exec rspec --tag gsl
DARU_TEST_GRUFF=1 bundle exec rspec --tag gruff
```

If you have problems installing nmatrix, please consult the [nmatrix installation wiki](https://github.com/SciRuby/nmatrix/wiki/Installation) or the [mailing list](https://groups.google.com/forum/#!forum/sciruby-dev).


While preparing your pull requests, don't forget to check your code with Rubocop:

`bundle exec rubocop`

[Optional] Install all Ruby versions which Daru currently supports with `rake spec setup`.


## Basic Development Flow
Expand All @@ -41,8 +62,6 @@ While preparing your pull requests, don't forget to check your code with Rubocop
4. Run the test suite with `rake spec`. (Alternatively you can use `guard` as described [here](https://github.com/SciRuby/daru/blob/master/CONTRIBUTING.md#testing). Also run Rubocop coding style guidelines with `rake cop`.
5. Commit the changes with `git commit -am "briefly describe what you did"` and submit pull request.

[Optional] You can run rspec for all Ruby versions at once with `rake spec run all`. But remember to first have all Ruby versions installed with `ruby spec setup`.


## Testing

Expand Down
16 changes: 16 additions & 0 deletions History.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
# Unreleased
* Major Enhancements
- Port development baseline to MRI Ruby 4.0.1.
- Add `mise.toml` toolchain configuration for reproducible local setup.
- Add runtime stdlib dependencies (`matrix`, `csv`) required on modern Ruby.
- Add missing development dependencies used by specs (`prime`, `mutex_m`, `benchmark`).
* Fixes
- Restore compatibility for CSV keyword arguments and URL reading via `URI.open`.
- Add `GroupBy#[]` for scalar and tuple-style group access.
- Fix `DataFrame` and `Vector` behavior regressions around mixed indexes and row/vector mutation.
- Add `DateTimeIndex.format` support for explicit parsing format.
- Improve SQL file source handling by supporting `sqlite3` connections directly.
* Testing
- Remove remaining pending examples from the default suite.
- Make optional integration suites (`sql`, `dbi`, `rserve`, `gsl`, `nmatrix`, `gruff`) opt-in and capability-aware.

# 0.3 (30 May 2020)
* Major Enhacements
- Remove official support for Ruby < 2.5.1. Now we only test with 2.5.1 and 2.7.1. (@v0dro)
Expand Down
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ daru (Data Analysis in RUby) is a library for storage, analysis, manipulation an

daru makes it easy and intuitive to process data predominantly through 2 data structures:
`Daru::DataFrame` and `Daru::Vector`. Written in pure Ruby works with all ruby implementations.
Tested with MRI 2.5.1 and 2.7.1.
Current development and CI baseline in this fork is MRI 4.0.1.

## daru plugin gems

Expand Down Expand Up @@ -53,6 +53,33 @@ This gem extends support for many Import and Export methods of `Daru::DataFrame`
$ gem install daru
```

## Development Setup

This fork is tested on Ruby `4.0.1` and includes a `mise.toml` toolchain file.

```console
$ mise trust
$ mise use ruby@4.0.1
$ bundle install
$ bundle exec rspec
```

Optional integration specs are excluded by default and can be enabled explicitly:

```console
$ DARU_TEST_SQL=1 bundle exec rspec --tag sql
$ DARU_TEST_DBI=1 bundle exec rspec --tag dbi
$ DARU_TEST_RSERVE=1 bundle exec rspec --tag rserve
```

Optional native backends are also opt-in:

```console
$ DARU_TEST_GSL=1 bundle exec rspec --tag gsl
$ DARU_TEST_NMATRIX=1 bundle exec rspec --tag nmatrix
$ DARU_TEST_GRUFF=1 bundle exec rspec --tag gruff
```

## Notebooks

#### Notebooks on most use cases
Expand Down
11 changes: 8 additions & 3 deletions daru.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ Gem::Specification.new do |spec|

# it is required by NMatrix, yet we want to specify clearly which minimal version is OK
spec.add_runtime_dependency 'packable', '~> 1.3.13'
spec.add_runtime_dependency 'matrix'
spec.add_runtime_dependency 'csv'

spec.add_development_dependency 'spreadsheet', '~> 1.1.1'
spec.add_development_dependency 'bundler', '>= 1.10'
Expand All @@ -42,18 +44,21 @@ Gem::Specification.new do |spec|
spec.add_development_dependency 'nyaplot', '~> 0.1.5'
spec.add_development_dependency 'nmatrix', '~> 0.2.1' if ENV['DARU_TEST_NMATRIX']
spec.add_development_dependency 'distribution', '~> 0.7'
spec.add_development_dependency 'prime'
spec.add_development_dependency 'gsl', '~>2.1.0.2' if ENV['DARU_TEST_GSL']
spec.add_development_dependency 'dbd-sqlite3'
spec.add_development_dependency 'dbi'
spec.add_development_dependency 'activerecord', '~> 6.0'
spec.add_development_dependency 'mutex_m'
spec.add_development_dependency 'benchmark'
spec.add_development_dependency 'mechanize'
# issue : https://github.com/SciRuby/daru/issues/493 occured
# with latest version of sqlite3
spec.add_development_dependency 'sqlite3'
spec.add_development_dependency 'rubocop', '~> 0.49.0'
spec.add_development_dependency 'ruby-prof'
spec.add_development_dependency 'simplecov'
spec.add_development_dependency 'gruff'
# Gruff pulls native ImageMagick dependencies through rmagick.
# Keep it opt-in for environments that explicitly test plotting via Gruff.
spec.add_development_dependency 'gruff' if ENV['DARU_TEST_GRUFF']
spec.add_development_dependency 'webmock'

spec.add_development_dependency 'nokogiri'
Expand Down
9 changes: 9 additions & 0 deletions lib/daru/core/group_by.rb
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,15 @@ def get_group group
)
end

# Returns a group as a DataFrame. Accepts scalar keys for single-level
# groups and tuple-like keys for multi-level groups.
def [](*group)
group = group.first if group.size == 1 && group.first.is_a?(Array)
group = [group] unless group.is_a?(Array)

get_group(group)
end

# Iteratively applies a function to the values in a group and accumulates the result.
# @param init (nil) The initial value of the accumulator.
# @yieldparam block [Proc] A proc or lambda that accepts two arguments. The first argument
Expand Down
18 changes: 14 additions & 4 deletions lib/daru/dataframe.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2468,6 +2468,10 @@ def aggregate(options={}, multi_index_level=-1)
end

def group_by_and_aggregate(*group_by_keys, **aggregation_map)
if aggregation_map.empty? && group_by_keys.last.is_a?(Hash)
aggregation_map = group_by_keys.pop
end

group_by(*group_by_keys).aggregate(aggregation_map)
end

Expand Down Expand Up @@ -2863,9 +2867,12 @@ def deduce_index index, source, vectors_have_same_index
elsif vectors_have_same_index
source.values[0].index.dup
else
all_indexes = source
.values.map { |v| v.index.to_a }
.flatten.uniq.sort # sort only if missing indexes detected
all_indexes = source.values.flat_map { |v| v.index.to_a }.uniq
begin
all_indexes = all_indexes.sort
rescue ArgumentError
# Mixed / non-comparable index types: preserve insertion order.
end

Daru::Index.new all_indexes
end
Expand Down Expand Up @@ -3055,7 +3062,10 @@ def coerce_vector vector

def update_data source, vectors
@data = @vectors.each_with_index.map do |_vec, idx|
Daru::Vector.new(source[idx], index: @index, name: vectors[idx])
vec_source = source[idx]
vec_source = vec_source.dup if vec_source.respond_to?(:dup)

Daru::Vector.new(vec_source, index: @index, name: vectors[idx])
end
end

Expand Down
11 changes: 10 additions & 1 deletion lib/daru/date_time/index.rb
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,12 @@ def date_time_from date_string, date_precision
date_string.match(/\-\d?\d/).to_s.delete('-').to_i
)
else
DateTime.parse date_string
# Keep backward-compatible configurable parsing when format is set.
if Daru::DateTimeIndex.format
DateTime.strptime(date_string, Daru::DateTimeIndex.format)
else
DateTime.parse(date_string)
end
end
end

Expand Down Expand Up @@ -215,6 +220,10 @@ class DateTimeIndex < Index
include Enumerable
Helper = DateTimeIndexHelper

class << self
attr_accessor :format
end

def self.try_create(source)
if source && ArrayHelper.array_of?(source, ::DateTime)
new(source, freq: :infer)
Expand Down
56 changes: 48 additions & 8 deletions lib/daru/io/io.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
module Daru
require 'open-uri'
require_relative 'csv/converters.rb'
module IOHelpers
class << self
Expand All @@ -16,6 +17,24 @@ def process_row(row,empty)
end
end

def process_fixed_width_row(line, ranges)
ranges.map do |range|
cell = line[range].to_s.strip
cell.empty? ? nil : try_string_to_number(cell)
end
end

def fixed_width_ranges(line, expected_columns=nil)
starts = line.to_enum(:scan, /\S+/).map { Regexp.last_match.begin(0) }
return [] if starts.empty?

starts = starts.first(expected_columns) if expected_columns
starts.each_with_index.map do |start_at, idx|
end_at = starts[idx + 1] || line.length
(start_at...end_at)
end
end

private

INT_PATTERN = /^[-+]?\d+$/
Expand Down Expand Up @@ -103,7 +122,7 @@ def dataframe_write_csv dataframe, path, opts={}
converters: :numeric
}.merge(opts)

writer = ::CSV.open(path, 'w', options)
writer = ::CSV.open(path, 'w', **options)
writer << dataframe.vectors.to_a unless options[:headers] == false

dataframe.each_row do |row|
Expand Down Expand Up @@ -153,10 +172,21 @@ def from_activerecord(relation, *fields)

def from_plaintext filename, fields
ds = Daru::DataFrame.new({}, order: fields)
fp = File.open(filename,'r')
fp.each_line do |line|
row = Daru::IOHelpers.process_row(line.strip.split(/\s+/),[''])
next if row == ["\x1A"]
lines = File.readlines(filename)
first_data_line = lines.find { |line| !line.strip.empty? && line.strip != "\x1A" }
ranges = Daru::IOHelpers.fixed_width_ranges(first_data_line.to_s, fields.size)

lines.each do |line|
next if line.strip == "\x1A"

row =
if ranges.size == fields.size && !ranges.empty?
Daru::IOHelpers.process_fixed_width_row(line, ranges)
else
Daru::IOHelpers.process_row(line.strip.split(/\s+/), [''])
end

row.concat([nil] * (fields.size - row.size)) if row.size < fields.size
ds.add_row(row)
end
ds.update
Expand All @@ -182,7 +212,7 @@ def load filename
end

def from_html path, opts
optional_gem 'mechanize', '~>2.7.5'
optional_gem 'mechanize', '>=2.7.5'
page = Mechanize.new.get(path)
page.search('table').map { |table| html_parse_table table }
.keep_if { |table| html_search table, opts[:match] }
Expand Down Expand Up @@ -231,22 +261,32 @@ def from_csv_prepare_converters(converters)
def from_csv_hash_with_headers(path, opts)
opts[:header_converters] ||= :symbol
::CSV
.parse(open(path), opts)
.parse(read_csv_source(path), **opts)
.tap { |c| yield c if block_given? }
.by_col.map { |col_name, values| [col_name, values] }.to_h
end

def from_csv_hash(path, opts)
csv_as_arrays =
::CSV
.parse(open(path), **opts)
.parse(read_csv_source(path), **opts)
.tap { |c| yield c if block_given? }
.to_a
headers = ArrayHelper.recode_repeated(csv_as_arrays.shift)
csv_as_arrays = csv_as_arrays.transpose
headers.each_with_index.map { |h, i| [h, csv_as_arrays[i]] }.to_h
end

def read_csv_source(path)
path = path.to_s

if path.match?(%r{\Ahttps?://}i)
URI.open(path, &:read)
else
File.read(path)
end
end

def html_parse_table(table)
headers, headers_size = html_scrape_tag(table,'th')
data, size = html_scrape_tag(table, 'td')
Expand Down
Loading