Skip to content

Conversation

@ldecicco-USGS
Copy link
Collaborator

  1. Added note to function docs that properties NA will return all columns
  2. Check properties against data in sysdata.rda. If it seems a property doesn't exist, go check dynamically
  3. Make sure query errors on Bad Request.
  4. Make sure empty returns come back with the attributes (they were getting dropped).
  5. Cleanup some of the internal functions that were overly complex (due to not being refactored after other parts of the code were refactored)
  6. Moved some functions into their own files to fine them easier
  7. Added a no_page option for quicker, but sketchier downloads (not recommended unless you promise not to get mad at us if it doesn't come back with all the data)

@ldecicco-USGS
Copy link
Collaborator Author

I don't know why GH Actions are having so much problems with the setup-r-dependencies. Sometimes that seems to just happen and it's fixed later. For now, you can verify that things are working on code.usgs.gov (the latest pipeline just passed the package checks).

Copy link
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do some testing, and I've included a few thoughts/tiny edits so far.

After running these tests:

test <- read_waterdata_daily(monitoring_location_id = "USGS-11348500", properties = "dogs")
test <- read_waterdata_daily(monitoring_location_id = "USGS-11348500", properties = c("value", "time", "dogs"))

I get the nice error message:

Error in (function (service, properties = NA_character_, bbox = NA, limit = NA,  : 
  Invalid properties requested.

A couple thoughts:

  • I don't see a request sent out to grab the properties schema. Should I?
  • It doesn't tell me which properties are invalid/valid. How tricky would it be to add that?

More later, nice work!!

#' Switch properties id
#'
#' If a user asks for either "id" or "output_id", it is only included in the
#' properties if that's the only column requested. "id" will always come back,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does id always come back? Seems like you could leave it out of non-NA properties and it wouldn't come back, right? That's how drpy works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. It only comes back when it's the ONLY property requested(because if you left it out there, you'd get all of the properties), or the only property + geometry (because geometry will come no matter what unless you add skipGeometry to the url):

properties <- c("id", "state_name", "country_name")
dataRetrieval:::switch_properties_id(properties, 
                              id = "monitoring_location_id")
[1] "state_name"   "country_name"
                               
properties2 <- c("monitoring_location_id", "state_name", "country_name")
dataRetrieval:::switch_properties_id(properties2, 
                              id = "monitoring_location_id")
[1] "state_name"   "country_name"
                               
properties3 <- c("monitoring_locations_id", "state_name", "country_name")
dataRetrieval:::switch_properties_id(properties3, 
                              id = "monitoring_location_id")
[1] "monitoring_locations_id" "state_name"              "country_name"           
                               
properties4 <- c("monitoring_location_id")
dataRetrieval:::switch_properties_id(properties4, 
                              id = "monitoring_location_id")
[1] "id"
                               
properties5 <- c("monitoring_location_id", "geometry")
dataRetrieval:::switch_properties_id(properties5, 
                              id = "monitoring_location_id")
[1] "id"

time = NA_character_,
bbox = NA,
limit = NA,
max_results = NA,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does max_results make sense with the no_paging argument being added? It seems like they do similar things...though max_results lets you do LESS than 50k rows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this...you could set it up this way. Since I just added the sf integration to the csv it might be OK.

Did you guys end up not doing a max_results in drpy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to make a new PR for removing max_results after I merge this one.

#' Convert columns if needed
#'
#' These are columns that have caused problems in testing.
#' Mostly if the columns are empty on 1 page, but not the next.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Do you have a good real life example?

#' @param convertType logical, defaults to `TRUE`. If `TRUE`, the function
#' will convert the data to dates and qualifier to string vector, and sepcifically
#' order the returning data frame by time and monitoring_location_id.
#' @param no_paging logical, defaults to `FALSE`. If `TRUE`, the data will
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also want to say something about how this option does not return geometry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does return an x, y however. I'll add some text

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just added some code to convert the x, y to geometry - so should be equal now to the json option (or...closer to equal at least).

#' @param properties A vector of requested columns to be returned from the query.
#' Available options are:
#' `r schema <- check_OGC_requests(endpoint = "continuous", type = "schema"); paste(names(schema$properties)[!names(schema$properties) %in% c("id", "internal_id")], collapse = ", ")`
#' `r dataRetrieval:::get_properties_for_docs("continuous", "continuous_id")`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, ?read_waterdata_continuous is not showing up for me. Is this an issue on my end?

@ehinman
Copy link
Collaborator

ehinman commented Dec 23, 2025

For some reason, I am getting an error with this line:

test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500", parameter_code = "00065", time = "2024-01-01/..")

Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/continuous/items?f=json&lang=en-US&skipGeometry=TRUE&limit=50000&monitoring_location_id=USGS-11348500&parameter_code=00065&time=2024-01-01%2F..
Error in walk_pages(req, max_results) : HTTP 500 Internal Server Error.

At first I was confused about why skipGeometry=TRUE, but now I remember that that's by design. Is this enumerated somewhere?

@ldecicco-USGS
Copy link
Collaborator Author

For some reason, I am getting an error with this line:

test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500", parameter_code = "00065", time = "2024-01-01/..")

Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/continuous/items?f=json&lang=en-US&skipGeometry=TRUE&limit=50000&monitoring_location_id=USGS-11348500&parameter_code=00065&time=2024-01-01%2F..
Error in walk_pages(req, max_results) : HTTP 500 Internal Server Error.

At first I was confused about why skipGeometry=TRUE, but now I remember that that's by design. Is this enumerated somewhere?

Good catch. Playing around it looks like the continuous data needs the time specified. If you use the vector approach (time = c(start, stop)) it adds that info automatically. We can add an issue to try to convert the time if someone enters a character string, but I kind of think most people will use the "vector" approach (since that's how the examples are shown):

# works:
test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500", 
                                  parameter_code = "00065", 
                                  time = c("2024-01-01", NA))
# works:
test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500", 
                                  parameter_code = "00065", 
                                  time = "2024-01-01T00:00:00Z/..")

# daily works with just a date:
test <- read_waterdata_daily(monitoring_location_id = "USGS-11348500", 
                                  parameter_code = "00060", 
                                  time = "2024-01-01/..")
# Doesn't work:
test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500", 
                             parameter_code = "00065", 
                             time = "2024-01-01/..")

Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
@ldecicco-USGS
Copy link
Collaborator Author

I merged some comment about returning 49k rows, not sure why you were seeing that, but you can get 50k rows when using the no_paging:

```r
> multi_site <- read_waterdata_daily(monitoring_location_id =  c("USGS-01491000",
+                                                                "USGS-01645000"),
+                                    parameter_code = c("00060", "00010"), 
+                                    no_paging = TRUE)
Setting no_paging to TRUE will only return the first 50000 rows of data with no indication of missing data
Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=csv&lang=en-US&limit=50000
Warning message:
In get_csv(req, max_results) :
  Missing data is probable. Use no_paging = FALSE to 
assure all requested data is returned.

> nrow(multi_site)
[1] 50000

@ehinman
Copy link
Collaborator

ehinman commented Dec 23, 2025

I merged some comment about returning 49k rows, not sure why you were seeing that, but you can get 50k rows when using the no_paging:

```r
> multi_site <- read_waterdata_daily(monitoring_location_id =  c("USGS-01491000",
+                                                                "USGS-01645000"),
+                                    parameter_code = c("00060", "00010"), 
+                                    no_paging = TRUE)
Setting no_paging to TRUE will only return the first 50000 rows of data with no indication of missing data
Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=csv&lang=en-US&limit=50000
Warning message:
In get_csv(req, max_results) :
  Missing data is probable. Use no_paging = FALSE to 
assure all requested data is returned.

> nrow(multi_site)
[1] 50000

Oh, the comment I made was that this message says "....will only return the first 50000 rows..." but technically it returns up to 50000 rows, and may return less if there are actually less than 50000 rows of data. I was just anticipating a scenario where someone might be stumped that the message says 50k but their call returns less than 50k.

@ldecicco-USGS ldecicco-USGS merged commit e397bbc into DOI-USGS:develop Dec 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants