-
Notifications
You must be signed in to change notification settings - Fork 94
Csv options #834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Csv options #834
Conversation
ldecicco-USGS
commented
Dec 19, 2025
- Added note to function docs that properties NA will return all columns
- Check properties against data in sysdata.rda. If it seems a property doesn't exist, go check dynamically
- Make sure query errors on Bad Request.
- Make sure empty returns come back with the attributes (they were getting dropped).
- Cleanup some of the internal functions that were overly complex (due to not being refactored after other parts of the code were refactored)
- Moved some functions into their own files to fine them easier
- Added a no_page option for quicker, but sketchier downloads (not recommended unless you promise not to get mad at us if it doesn't come back with all the data)
Adding continuous data See merge request water/dataRetrieval!455
|
I don't know why GH Actions are having so much problems with the setup-r-dependencies. Sometimes that seems to just happen and it's fixed later. For now, you can verify that things are working on code.usgs.gov (the latest pipeline just passed the package checks). |
ehinman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to do some testing, and I've included a few thoughts/tiny edits so far.
After running these tests:
test <- read_waterdata_daily(monitoring_location_id = "USGS-11348500", properties = "dogs")
test <- read_waterdata_daily(monitoring_location_id = "USGS-11348500", properties = c("value", "time", "dogs"))
I get the nice error message:
Error in (function (service, properties = NA_character_, bbox = NA, limit = NA, :
Invalid properties requested.
A couple thoughts:
- I don't see a request sent out to grab the properties schema. Should I?
- It doesn't tell me which properties are invalid/valid. How tricky would it be to add that?
More later, nice work!!
| #' Switch properties id | ||
| #' | ||
| #' If a user asks for either "id" or "output_id", it is only included in the | ||
| #' properties if that's the only column requested. "id" will always come back, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does id always come back? Seems like you could leave it out of non-NA properties and it wouldn't come back, right? That's how drpy works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. It only comes back when it's the ONLY property requested(because if you left it out there, you'd get all of the properties), or the only property + geometry (because geometry will come no matter what unless you add skipGeometry to the url):
properties <- c("id", "state_name", "country_name")
dataRetrieval:::switch_properties_id(properties,
id = "monitoring_location_id")
[1] "state_name" "country_name"
properties2 <- c("monitoring_location_id", "state_name", "country_name")
dataRetrieval:::switch_properties_id(properties2,
id = "monitoring_location_id")
[1] "state_name" "country_name"
properties3 <- c("monitoring_locations_id", "state_name", "country_name")
dataRetrieval:::switch_properties_id(properties3,
id = "monitoring_location_id")
[1] "monitoring_locations_id" "state_name" "country_name"
properties4 <- c("monitoring_location_id")
dataRetrieval:::switch_properties_id(properties4,
id = "monitoring_location_id")
[1] "id"
properties5 <- c("monitoring_location_id", "geometry")
dataRetrieval:::switch_properties_id(properties5,
id = "monitoring_location_id")
[1] "id"| time = NA_character_, | ||
| bbox = NA, | ||
| limit = NA, | ||
| max_results = NA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does max_results make sense with the no_paging argument being added? It seems like they do similar things...though max_results lets you do LESS than 50k rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn on this...you could set it up this way. Since I just added the sf integration to the csv it might be OK.
Did you guys end up not doing a max_results in drpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to make a new PR for removing max_results after I merge this one.
| #' Convert columns if needed | ||
| #' | ||
| #' These are columns that have caused problems in testing. | ||
| #' Mostly if the columns are empty on 1 page, but not the next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: Do you have a good real life example?
| #' @param convertType logical, defaults to `TRUE`. If `TRUE`, the function | ||
| #' will convert the data to dates and qualifier to string vector, and sepcifically | ||
| #' order the returning data frame by time and monitoring_location_id. | ||
| #' @param no_paging logical, defaults to `FALSE`. If `TRUE`, the data will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also want to say something about how this option does not return geometry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does return an x, y however. I'll add some text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just added some code to convert the x, y to geometry - so should be equal now to the json option (or...closer to equal at least).
| #' @param properties A vector of requested columns to be returned from the query. | ||
| #' Available options are: | ||
| #' `r schema <- check_OGC_requests(endpoint = "continuous", type = "schema"); paste(names(schema$properties)[!names(schema$properties) %in% c("id", "internal_id")], collapse = ", ")` | ||
| #' `r dataRetrieval:::get_properties_for_docs("continuous", "continuous_id")`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, ?read_waterdata_continuous is not showing up for me. Is this an issue on my end?
|
For some reason, I am getting an error with this line: At first I was confused about why |
Good catch. Playing around it looks like the continuous data needs the time specified. If you use the vector approach (time = c(start, stop)) it adds that info automatically. We can add an issue to try to convert the time if someone enters a character string, but I kind of think most people will use the "vector" approach (since that's how the examples are shown): # works:
test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500",
parameter_code = "00065",
time = c("2024-01-01", NA))
# works:
test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500",
parameter_code = "00065",
time = "2024-01-01T00:00:00Z/..")
# daily works with just a date:
test <- read_waterdata_daily(monitoring_location_id = "USGS-11348500",
parameter_code = "00060",
time = "2024-01-01/..")
# Doesn't work:
test <- read_waterdata_continuous(monitoring_location_id = "USGS-11348500",
parameter_code = "00065",
time = "2024-01-01/..") |
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
Co-authored-by: Elise Hinman <121896266+ehinman@users.noreply.github.com>
|
I merged some comment about returning 49k rows, not sure why you were seeing that, but you can get 50k rows when using the no_paging: ```r
> multi_site <- read_waterdata_daily(monitoring_location_id = c("USGS-01491000",
+ "USGS-01645000"),
+ parameter_code = c("00060", "00010"),
+ no_paging = TRUE)
Setting no_paging to TRUE will only return the first 50000 rows of data with no indication of missing data
Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=csv&lang=en-US&limit=50000
Warning message:
In get_csv(req, max_results) :
Missing data is probable. Use no_paging = FALSE to
assure all requested data is returned.
> nrow(multi_site)
[1] 50000 |
Oh, the comment I made was that this message says "....will only return the first 50000 rows..." but technically it returns up to 50000 rows, and may return less if there are actually less than 50000 rows of data. I was just anticipating a scenario where someone might be stumped that the message says 50k but their call returns less than 50k. |