Skip to content
Merged
5 changes: 2 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: geocodebr
Title: Geolocalização De Endereços Brasileiros (Geocoding Brazilian Addresses)
Version: 0.6.2
Version: 0.6.2.9000
Authors@R: c(
person("Rafael H. M.", "Pereira", , "rafa.pereira.br@gmail.com", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2125-7465")),
Expand Down Expand Up @@ -42,7 +42,6 @@ Imports:
duckspatial (>= 1.0.0),
enderecobr (>= 0.5.0),
fs,
geoarrow (>= 0.4.2),
glue,
h3r,
httr2 (>= 1.0.0),
Expand All @@ -68,4 +67,4 @@ Config/testthat/edition: 3
Encoding: UTF-8
Language: pt
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.3
Config/roxygen2/version: 8.0.0
19 changes: 18 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# geocodebr v0.6.3 dev

## Correção de bugs (Bug fixes)

- Bug corrigido que agora permite usuários passarem como input tabelas de
endereços com apenas alguns campos. Os campos de municio e unidade da federação
continuam sendo obrigatórios. Encerra [#89](https://github.com/ipeaGIT/geocodebr/issues/89)
e [#94](https://github.com/ipeaGIT/geocodebr/issues/94)

## Mudanças pequenas (Minor changes)

- A função `geocode_reverso()` teve pequeno ganho de velocidade, com drástica
redução no consumo de memória. Na amostra de 1000 pontos, o uso de memória caiu
de 161MB para 95MB.



# geocodebr v0.6.2

## Correção de bugs (Bug fixes)
Expand All @@ -7,7 +24,7 @@ data release corrente, e ignora eventuais dados de releases antigos que estejam
na pasta. [Encerra #90](https://github.com/ipeaGIT/geocodebr/issues/90)
- A função `geocode()` agora retorna erro informativo quando alguma coluna na tabela
de input tem nome com algum caractere não alfanumérico, como . , ? ^ - ! ~. Não
há problema com o sublinhado _, como em “name_muni”. Fecha [issue #92](https://github.com/ipeaGIT/geocodebr/issues/92)
há problema com o barra baixa _, como em “name_muni”. Fecha [issue #92](https://github.com/ipeaGIT/geocodebr/issues/92)
- Corrigido erro na função de `geocode_reverso()` que impedia usar valores muito
altos de `dist_max`. [Encerra #88](https://github.com/ipeaGIT/geocodebr/issues/88)
- Incluido 'Language: pt' na DESCRIPTION
Expand Down
49 changes: 44 additions & 5 deletions R/geocode.R
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,30 @@ geocode_core <- function(
# systime start 66666 ----------------
# timer$mark("Start")

# fix eventual missing fields in input data -------------------------------------------------------
# geocodebr requires all address fields to be declared
# if one or more fields are empty, we add mock columns with empty strings

campos_endereco <- assert_and_assign_address_fields(
campos_endereco,
enderecos
)

# determine which columns are missing, if any
missing_cols <- campos_endereco[unlist(lapply(campos_endereco, is.null))]

if (length(missing_cols)>=1) {

# add empty string to missing cols
data.table::setDT(enderecos)
new_colnames <- paste0(names(missing_cols), "tempgeocodebr")
enderecos[, (new_colnames) := NA_character_ ]

# update address fields with fake columns
campos_endereco[sapply(campos_endereco, is.null)] <- as.list(new_colnames)
}


# normalize input data -------------------------------------------------------
# standardizing the addresses table to increase the chances of finding a match
# in the CNEFE data
Expand All @@ -219,11 +243,6 @@ geocode_core <- function(
message_standardizing_addresses()
}

campos_endereco <- assert_and_assign_address_fields(
campos_endereco,
enderecos
)

input_padrao <- enderecobr::padronizar_enderecos(
enderecos = enderecos,
campos_do_endereco = enderecobr::correspondencia_campos(
Expand Down Expand Up @@ -487,6 +506,21 @@ geocode_core <- function(
# drop geocodebr temp id column
output_df[, tempidgeocodebr := NULL]

# # col precisao como ordered factor
# ordem_precisao <- c(
# "numero",
# "numero_aproximado",
# "logradouro",
# "cep",
# "localidade",
# "municipio"
# )
# output_df[, precisao := factor(
# precisao,
# levels = ordem_precisao,
# ordered = TRUE
# )]

# Disconnect from DuckDB when done
duckdb::dbDisconnect(con)

Expand All @@ -508,6 +542,11 @@ geocode_core <- function(
# timer$mark("Add H3")
}

# drop eventual mock columns with empty strings
if (length(missing_cols)>=1) {
output_df[, (new_colnames) := NULL]
}

# remove data.table class
data.table::setindex(output_df, NULL)
data.table::setDF(output_df)
Expand Down
67 changes: 19 additions & 48 deletions R/geocode_reverso.R
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,6 @@ geocode_reverso <- function(
)
}

# pontos <- sf::st_transform(pontos, 4674)


# prep input -------------------------------------------------------

# converte pontos de input para data.frame
Expand Down Expand Up @@ -114,24 +111,8 @@ geocode_reverso <- function(
# limita escopo de busca aos municipios -------------------------------------------------------
# determine potential municipalities
munis <- system.file("extdata/munis_bbox_2022.parquet", package = "geocodebr") |>
arrow::open_dataset() |>
sf::st_as_sf()

# place holder to use geoarrow becaue:
# Namespace in Imports field not imported from: 'geoarrow'
# All declared Imports should be used.
geoarrow::as_geoarrow_vctr("POINT (0 1)")

# munis_path <- system.file("extdata/munis_2022.parquet", package = "geocodebr")
#
# query_register_muni <- glue::glue(
# "CREATE OR REPLACE TEMP VIEW munis AS
# SELECT *,
# geometry::GEOMETRY AS geometry
# FROM read_parquet('{munis_path}');"
# )
#
# DBI::dbExecute(conn, query_register_muni)
duckspatial::ddbs_open_dataset()


potential_munis <- duckspatial::ddbs_join(
x = pontos,
Expand Down Expand Up @@ -185,22 +166,28 @@ geocode_reverso <- function(
# ST_Point(lon, lat)::GEOMETRY('EPSG:4674') AS geom


cnefe_utm_duck <- duckspatial::ddbs_transform(
# converte cnefe para UTM
cnefe_utm_duck <- duckspatial::ddbs_transform(
x = 'cnefe_tb',
y = 'EPSG:31983',conn = conn,
y = 'EPSG:31983',
conn = conn,
quiet = TRUE
)

# input to UTM
input_utm_duck <- duckspatial::ddbs_transform(
# converte pontos para UTM
input_utm_duck <- duckspatial::ddbs_transform(
x = pontos,
y = 'EPSG:31983',
conn = conn,
name = "pontos_utm",
overwrite = T,
quiet = TRUE
)

# buffers around input points
# buffer around input points
buff <- duckspatial::ddbs_buffer(
x = input_utm_duck,
x = "pontos_utm",
conn = conn,
distance = dist_max,
quiet = TRUE
)
Expand All @@ -210,30 +197,14 @@ geocode_reverso <- function(
result <- duckspatial::ddbs_join(
x = cnefe_utm_duck,
y = buff,
join = "within",
join = "intersects", # intersects within
conn = conn,
name = "join_result",
overwrite = T,
quiet = TRUE
)
)

# write to connection
duckspatial::ddbs_write_table(
conn = conn,
data = input_utm_duck,
name = "pontos_utm",
overwrite = T,
temp_view = T,
quiet = TRUE
)

duckspatial::ddbs_write_table(
conn = conn,
data = result,
name = "join_result",
overwrite = T,
temp_view = T,
quiet = TRUE
)

# Get column names from both tables
cols_a <- DBI::dbGetQuery(conn, "SELECT column_name FROM (DESCRIBE pontos_utm)")$column_name
cols_b <- DBI::dbGetQuery(conn, "SELECT column_name FROM (DESCRIBE join_result)")$column_name
Expand All @@ -256,7 +227,7 @@ geocode_reverso <- function(
ST_Distance(a.geometry, b.geometry) AS distancia_metros,
ROW_NUMBER() OVER (
PARTITION BY a.id
ORDER BY ST_Distance(a.geometry, b.geometry)
ORDER BY distancia_metros
) AS rn
FROM pontos_utm AS a
JOIN join_result AS b
Expand Down
5 changes: 0 additions & 5 deletions R/onLoad.R

This file was deleted.

27 changes: 14 additions & 13 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
## R CMD check results

── R CMD check results ───────────────────────────────────────────── geocodebr 0.6.2 ────
Duration: 2m 33.3s
── R CMD check results ───────────────────────────────────────────── geocodebr 0.6.3 ────
Duration: 2m 39s

0 errors ✔ | 0 warnings ✔ | 0 notes ✔

# geocodebr v0.6.2

## Correção de bugs (Bug fixes)
# geocodebr v0.6.3

- Fixed a bug to ensure that the package uses only cached data from the
current release and ignores any data from older releases that may be
in the folder. [Closes #90](https://github.com/ipeaGIT/geocodebr/issues/90)
- The `geocode()` function now returns an informational error when a column in the
input table has a name containing a non-alphanumeric character, such as . , ? ^ - ! ~. There
is no issue with the underscore _, as in “name_muni”. Closed [issue #92](https://github.com/ipeaGIT/geocodebr/issues/92)
- Fixed a bug in the `geocode_reverso()` function that prevented the use of very
high values for `dist_max`. [Closes #88](https://github.com/ipeaGIT/geocodebr/issues/88)
- Added ‘Language: pt’ to DESCRIPTION
## Bug fixes

- Fixed a bug that now allows users to pass address tables containing only a
subset of address fields as input. Municipality and state fields remain
mandatory. Closes [#89](https://github.com/ipeaGIT/geocodebr/issues/89)
and [#94](https://github.com/ipeaGIT/geocodebr/issues/94)

## Minor changes

- The `geocode_reverso()` function achieved a small speed improvement, along
with a substantial reduction in memory usage. In a sample of 1,000 points,
memory consumption dropped from 161MB to 95MB.
Binary file modified inst/extdata/large_sample.parquet
Binary file not shown.
Binary file modified inst/extdata/munis_bbox_2022.parquet
Binary file not shown.
2 changes: 1 addition & 1 deletion man/definir_pasta_cache.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion man/geocodebr.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 9 additions & 4 deletions tests/tests_rafa/benchmark_20k.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@ ncores <- 7


campos <- geocodebr::definir_campos(
logradouro = 'logradouro',
numero = 'numero',
# logradouro = 'logradouro',
# numero = 'numero',
cep = 'cep',
localidade = 'bairro',
municipio = 'municipio',
estado = 'uf'
)

input_df$logradouro <- NULL
input_df$numero <- NULL

bench::mark(iterations = 3,
a <- geocodebr::geocode(
Expand All @@ -34,5 +36,8 @@ bench::mark(iterations = 3,
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory
# streetmap 0.6.0 dev 7.10s 7.26s 0.136 5.47MB 0 5 0 36.7s <df> <Rprofmem>
# laptop 0.6.0 CRAN 5.2s 5.53s 0.174 7.46MB 0 5 0 28.8s <df>
# load 1 a <- geoc… 8.1s 8.79s 0.116 3.03MB 0 3 0 26s
# sem 1 a <- geoc… 10.3s 10.5s 0.0944 3.03MB 0 3 0 31.8s
#1 "" 8.67s 8.86s 0.113 2.04MB 0.0565 2 1 17.7s <df>
#1 "" 8.35s 8.82s 0.115 5.43MB 0 3 0 26.1s <df>
#1 "NA_int" 6.52s 6.58s 0.152 4.18MB 0.0760 2 1 13.2s <df>
#1 "NA_int" 6.54s 6.81s 0.147 1.73MB 0.0734 2 1 13.6s <df>
1 a <- geocodebr::ge… 7.58s 7.72s 0.124 4.18MB 0 3 0 24.1s <df>
3 changes: 3 additions & 0 deletions tests/tests_rafa/generate_sample_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,9 @@ setDT(df)
df[, id := 1:nrow(df)]
head(df)

data.table::setindex(df, NULL)
data.table::setDF(df)

arrow::write_parquet(df, './inst/extdata/large_sample.parquet')


Expand Down
Loading
Loading