From 55c2f35eff03ede0ec0f682e2dd2e2f2b1c39433 Mon Sep 17 00:00:00 2001 From: Matthias Broecheler Date: Sun, 18 Aug 2024 14:35:33 -0700 Subject: [PATCH 1/2] refactoring documentation for 0.6 release --- docs/architectures/intro.md | 1 + docs/getting-started/concepts/datasqrl.md | 109 ++--- .../getting-started/concepts/when-datasqrl.md | 2 + docs/getting-started/concepts/why-datasqrl.md | 28 +- docs/getting-started/getting-started.md | 32 -- docs/getting-started/quickstart.md | 379 ++++++++---------- docs/getting-started/tutorials/overview.md | 19 +- .../tutorials/recommendations/intro.md | 232 +++++------ docs/intro.md | 117 +----- docs/process/intro.md | 20 +- docs/reference/concepts/data-product.md | 19 +- docs/reference/sqrl/learn.md | 52 ++- sidebars.js | 12 +- 13 files changed, 441 insertions(+), 581 deletions(-) create mode 100644 docs/architectures/intro.md diff --git a/docs/architectures/intro.md b/docs/architectures/intro.md new file mode 100644 index 00000000..2356fdf0 --- /dev/null +++ b/docs/architectures/intro.md @@ -0,0 +1 @@ +# Data Architectures \ No newline at end of file diff --git a/docs/getting-started/concepts/datasqrl.md b/docs/getting-started/concepts/datasqrl.md index 1a3ae40a..93a0e708 100644 --- a/docs/getting-started/concepts/datasqrl.md +++ b/docs/getting-started/concepts/datasqrl.md @@ -4,91 +4,64 @@ title: "What is DataSQRL?" # What is DataSQRL? +Copy-paste repo README -DataSQRL is an open-source compiler and build tool for implementing data products as data pipelines. A [data product](/docs/reference/concepts/data-product) processes, transforms, or analyzes data from one or multiple sources (user input, databases, data streams, API calls, file storage, etc.) and exposes the result as raw data, in a database, or through an API.
-DataSQRL eliminates most of the laborious code of implementing and stitching together multiple technologies into data pipelines. +DataSQRL is a flexible data development framework for building various types of streaming data architectures, like data pipelines, event-driven microservices, and Kappa. It provides the basic structure, common patterns, and a set of tools for streamlining the development process of [data products](/docs/reference/concepts/data-product). -Building a data product with DataSQRL takes 3 steps: +DataSQRL integrates any combination of the following technologies: +* **Apache Flink:** a distributed and stateful stream processing engine. +* **Apache Kafka:** a distributed streaming platform. +* **PostgreSQL:** a reliable open-source relational database system. +* **Apache Iceberg:** an open table format for large analytic datasets. +* **Snowflake:** a scalable cloud data warehousing platform. +* **RedPanda:** a Kafka-compatible streaming data platform. +* **Yugabyte:** a distributed open-source relational database. +* **Vert.x:** a reactive server framework for building data APIs. -1. **Implement SQL script:** You combine, transform, and analyze the input data using SQL. -2. **Expose Data (optional):** You define how to expose the transformed data in the API or database. -3. **Compile Data Pipeline:** DataSQRL compiles the SQL script and output specification into a fully integrated data pipeline. The compiled data pipeline ingests raw data, processes it according to the transformations and analyses defined in the SQL script, and serves the resulting data through the specified API or database. +You define the data processing in SQL (with support for custom functions in Java, Scala and soon Python) and DataSQRL generates the glue code, schemas, and mappings to automatically connect and configure these components into a coherent data architecture. DataSQRL also generates Docker Compose templates for local execution or deployment to Kubernetes or cloud-managed services. -In a nutshell, DataSQRL is an abstraction layer that takes care of the nitty-gritties of building efficient data pipelines and gives developers an easy-to-use tool to build data products. - -Follow the [quickstart tutorial](../../quickstart) to build a data product in a few minutes and see how DataSQRL works in practice. - -## How DataSQRL Works - -Compiled DataSQRL data pipeline > - -DataSQRL compiles the SQL script and output specification into a data pipeline that uses data technologies like [Apache Kafka](https://kafka.apache.org/), [Apache Flink](https://flink.apache.org/), or [Postgres](https://postgresql.org/). - -DataSQRL has a pluggable engine architecture which allows it to support various stream processors, databases, data warehouses, data streams, and API servers. Feel free to contribute your favorite data technology as a DataSQRL engine to the open-source, wink wink. - -DataSQRL can generate data pipelines with multiple topologies. Take a look at the [types of data products](/docs/reference/concepts/data-product#types) that DataSQRL can build. You can further customize those pipeline topologies in the DataSQRL [package configuration](/docs/reference/sqrl/datasqrl-spec/) which defines the data technologies at each stage of the resulting data pipeline. - -DataSQRL compiles executables for each engine in the pipeline which can be deployed on the data technologies and cloud services you already use. -In addition, DataSQRL provides development tooling that makes it easy to run and test data pipelines locally to speed up the development cycle. +The architectures that DataSQRL supports -## What DataSQRL Does +DataSQRL supports multiple types of data architectures as shown above. Learn more about the [10 types of data architectures](../when-datasqrl) you can build with DataSQRL. -Okay, you get the idea of a compiler that produces integrated data pipelines. But what exactly does DataSQRL do for you? Glad you asked. +## DataSQRL Features -DataSQRL Compilation > +* ๐Ÿ”— **System Integration:** Combine various data technologies into streamlined data architectures. +* โ˜ฏ๏ธ **Declarative + Imperative:** Define the data flow in SQL and specific data transformations in Java, Scala, or soon Python. +* ๐Ÿงช **Testing Framework:** Automated snapshot testing. +* ๐Ÿ”„ **Data Flow Optimization:** Optimize data flow between systems through data mapping, partitioning, and indexing for scalability and performance. +* โœ”๏ธ **Consistent:** Ensure at-least or exactly-once data processing for consistent results across the entire system. +* ๐Ÿ“ฆ **Dependency management:** Manage data sources and sinks with versioning and repository. +* ๐Ÿ“Š **GraphQL Schema Generator:** Expose processed data through a GraphQL API with subscription support for headless data services. (REST coming soon) +* ๐Ÿค– **Integrated AI:** Support for vector data type, vector embeddings, LLM invocation, and ML model inference. +* { } **JSON Support:** Native JSON data type and JSON schema discovery. +* ๐Ÿ” **Visualization Tools:** Inspect and debug data architectures visually. +* ๐Ÿชต **Logging framework:** for observability and debugging. +* ๐Ÿš€ **Deployment Profiles:** Automate the deployment of data architectures through configuration. -To produce fully integrated data pipelines, the DataSQLR compiler: -* resolves data imports to data source connectors and generates input schemas for the stream ingestion, -* synchronizes data schemas and data management across all engines in the data pipeline, -* aligns timestamps and watermarks across the engines, -* orchestrates optimal data flow between engines, -* translates the SQL script to the respective engine for execution, -* and generates an API server that implements the given API specification. - -To produce high-performance data pipelines that respond to new input data in realtime and provide low latency, high throughput APIs to many concurrent users, DataSQRL optimizes the compiled data pipeline by: -* partitioning the data flow and co-locating data where possible. -* pruning the execution graph and consolidating repetitive computations. -* determining when to pre-compute data transformations in the streaming engine to reduce response latencies versus computing result sets at request time in the database or server to avoid data staleness and combinatorial explosion in pre-computed results. -* determining the optimal set of index structures to install in the database. - -In other words, DataSQRL can save you a lot of time and allows you to focus on what matters: implementing the logic and API of your data product. - -## Learn More - -- Read the [quickstart tutorial](../../quickstart) to get a feel for DataSQRL while building an entire data product in 10 minutes. -- Find out [Why DataSQRL Exists](../why-datasqrl) and what benefits it provides. -- [Compare DataSQRL](../../concepts/when-datasqrl) to other data technologies and see when to use it. -- Learn more about the [DataSQRL Optimizer](/docs/reference/sqrl/learn/#datasqrl-optimizer) and how the DataSQRL compiler generates efficient data pipelines. - - \ No newline at end of file diff --git a/docs/getting-started/concepts/when-datasqrl.md b/docs/getting-started/concepts/when-datasqrl.md index d56dc2f2..a1044bb4 100644 --- a/docs/getting-started/concepts/when-datasqrl.md +++ b/docs/getting-started/concepts/when-datasqrl.md @@ -4,6 +4,8 @@ title: "When to use DataSQRL" # When Should I Use DataSQRL? +replace with 10 architectures + DataSQRL is an intelligent compiler for data pipelines that eliminates data plumbing so you can build efficient data products faster, cheaper, and better. diff --git a/docs/getting-started/concepts/why-datasqrl.md b/docs/getting-started/concepts/why-datasqrl.md index 7a996f09..bc478208 100644 --- a/docs/getting-started/concepts/why-datasqrl.md +++ b/docs/getting-started/concepts/why-datasqrl.md @@ -5,7 +5,7 @@ title: "Why Use DataSQRL?" # Why Use DataSQRL? When you build data products, you end up wasting most of your time and effort on data plumbing. In fact, 80% ([source](#footnotes)) of data products fail to deliver value because of data plumbing issues. -We developed the open-source DataSQRL compiler to eliminate data plumbing so you can build efficient data products in days instead of months. +We developed the open-source DataSQRL data development framework to automate data plumbing so you can build efficient data products in days instead of months. DataSQRL allows you to build with data > @@ -21,23 +21,23 @@ We are developing DataSQRL as a tool for developers to build data pipelines and ## What is Data Plumbing? {#dataplumbing} -What's data plumbing? It's all that extra engineering you need to turn a data transformation into a deployable data pipeline. Specifically, there are 4 types of data plumbing that waste the most time, money, and effort. +What's data plumbing? It's all the architecture work and glue code you need to implement to integrate multiple data technologies into a coherent data architecture. Specifically, there are 4 types of data plumbing that waste the most time, money, and effort. ### Code Fragmentation Data Pipeline Architecture -A data pipeline consists of multiple technologies that work in concert to transform the raw input data into a valuable result. Data stream like Apache Kafka, stream processors like Apache Flink, databases like Postgres, and API servers like GraphQL. +A data architecture consists of multiple technologies that work in concert to transform the raw input data into a valuable result. Streaming platforms like Apache Kafka, stream processors like Apache Flink, databases like Postgres, and API servers like GraphQL. -To implement a coherent data pipeline, you need to split the logic of your data product across the various technologies that make up your data pipeline which leads to code fragmentation. And each technology uses a different language, dialect, and conceptual model which means you need to become an expert in each of the technologies or assemble a team of experts to implement a single data pipeline. +To implement a coherent data architecture, you need to split the logic of your data product across the various technologies that make up your architecture which leads to code fragmentation. And each technology uses a different language, dialect, and conceptual model which means you need to become an expert in each of the technologies or assemble a team of experts to implement a single data product. -That introduces a lot of coordination overhead, makes it hard to implement all the pipeline stages coherently, and very expensive to refactor a data product. +That introduces a lot of coordination overhead, makes it hard to implement all the pipeline stages coherently, and very expensive to evolve and maintain a data product. ### Data Flow Orchestration -To make the data flow smoothly through the data pipeline, you have to implement the integration points between the various technologies in your data pipeline. That requires a lot of "glue code" that is hard to debug and maintain. In addition, you have to be very careful that data flows are synchronized in time to avoid inconsistencies. +To make the data flow smoothly through the data architecture, you have to implement the integration points between the various technologies in your data architecture. That requires a lot of "glue code" that is hard to debug and maintain. In addition, you have to be very careful that data flows are synchronized in time to avoid inconsistencies. -Furthermore, you end up writing a lot of configuration code to define how data is ingested and moved through the system. All of this code is specific to a particular data pipeline and needs to be maintained over time. +Furthermore, you end up writing a lot of configuration code to define how data is ingested and moved through the system. All of this code is specific to a particular data architecture and needs to be maintained over time. ### Data Mapping @@ -47,6 +47,10 @@ Each technology in the data pipeline has its own data and schema representation Each technology in the data pipeline has a different physical model and operational characteristics which makes it difficult to optimize data pipelines for efficient operation. To optimize a data pipeline you need deep expertise in each of the technologies and understand how their divergent operational behaviors play off each other to introduce inefficiencies. +### Manual DevOps + +Running a data architecture in production requires a lot of manual DevOps or custom automation. + ## Benefits of DataSQLR If you are building a data product, DataSQRL can save you a lot of time, make your life easier, and produce better implementations by eliminating data plumbing. @@ -57,10 +61,10 @@ Let's break that down: DataSQRL saves you time > -DataSQRL's intelligent compiler eliminates data plumbing and saves you the time and effort required to tackle the four types of data plumbing outlined above. DataSQRL handles all the time-consuming details of data pipeline implementation for you. You implement the logic of your data product in SQL, and DataSQRL compiles that logic into an optimized data pipeline. +DataSQRL's intelligent compiler eliminates data plumbing and saves you the time and effort required to tackle the five types of data plumbing outlined above. DataSQRL handles all the time-consuming details of data architecture implementation for you. You implement the logic of your data product in SQL, and DataSQRL compiles that logic into an optimized data architecture. DataSQRL gives you a higher level of abstraction, so you don't get bogged down implementing, integrating, and optimizing low level data abstractions.
-You don't write your software in low-level languages like [Assembly](https://en.wikipedia.org/wiki/Assembly_language). You use a higher level language like Javascript, Python, Java, etc that compile into machine code to make you more productive. DataSQRL is a compiler for your data pipeline to make you more productive. +You don't write your software in low-level languages like [Assembly](https://en.wikipedia.org/wiki/Assembly_language). You use a higher level language like Javascript, Python, Java, etc that compile into machine code to make you more productive. DataSQRL is a compiler for your data architecture to make you more productive. ### DataSQRL is Easy to Use {#easy-to-use} @@ -69,12 +73,12 @@ DataSQRL gives you a higher-level of abstraction for implementing data products. First, DataSQRL handles a lot of things for you that you don't have to worry about at all like all the data plumbing issues outlined above. When you implement a data product in DataSQRL you have to learn fewer concepts to be successful. DataSQRL doesn't hide any of these elements from you. You get full visibility and can control those elements if you like. But you don't have to and in most cases you never have to worry about it. -You can focus entirely on the logic of your data product by defining data transformations and analytics. DataSQRL uses those definitions to figure out what the schema should look like, how the data should flow, and how to retrieve it for API requests. This simplifies implementing a data product and saves you a ton of data plumbing code that holds a data pipeline together. +You can focus entirely on the logic of your data product by defining data transformations and analytics. DataSQRL uses those definitions to figure out what the schema should look like, how the data should flow, and how to retrieve it for API requests. This simplifies implementing a data product and saves you a ton of data plumbing code that holds a data architecture together. DataSQRL is easy to use > -Second, the DataSQRL compiler not only determines how to implement data operations but also *when*. A common tradeoff data pipeline implementations face is processing data at ingest time (i.e. when a new data record is ingested) versus at query time (i.e. when a user of the API issues a request). For example, suppose we are providing an API that shows customers the total amount of money they have spent at our e-commerce store. We can compute this value by summing over all the orders at query time or incrementally updating a sum at ingest time when a new order is placed. The result is the same but has different operational characteristics and can make the difference between things humming along and your database being brought to its knees.
-If you are thinking "why are you boring me with these data pipeline implementation trivia", that's exactly the point: With DataSQRL you don't have to think about this. It abstracts those tradeoffs away. If you are going the low-level route and assemble a data pipeline architecture yourself, you'll have to worry about these and other tradeoffs as you design the system. And that makes it very expensive to evolve your pipeline over time. +Second, the DataSQRL compiler not only determines how to implement data operations but also *when*. A common tradeoff data architecture implementations face is processing data at ingest time (i.e. when a new data record is ingested) versus at query time (i.e. when a user of the API issues a request). For example, suppose we are providing an API that shows customers the total amount of money they have spent at our e-commerce store. We can compute this value by summing over all the orders at query time or incrementally updating a sum at ingest time when a new order is placed. The result is the same but has different operational characteristics and can make the difference between things humming along and your database being brought to its knees.
+If you are thinking "why are you boring me with these data architecture implementation trivia", that's exactly the point: With DataSQRL you don't have to think about this. It abstracts those tradeoffs away. If you are going the low-level route and assemble a data architecture architecture yourself, you'll have to worry about these and other tradeoffs as you design the system. And that makes it very expensive to evolve your architecture over time. ### DataSQRL Compiles Fast & Efficient Pipelines {#performance} diff --git a/docs/getting-started/getting-started.md b/docs/getting-started/getting-started.md index 82a461d9..8d4c2685 100644 --- a/docs/getting-started/getting-started.md +++ b/docs/getting-started/getting-started.md @@ -5,38 +5,6 @@ import TabItem from '@theme/TabItem'; Let's build a data pipeline with DataSQRL in just a few simple steps. -## Installation - -You can install DataSQRL on your Mac with [HomeBrew](https://brew.sh/) or use [Docker](https://www.docker.com/products/docker-desktop/) on any machine. - - - - -```bash -brew tap datasqrl/sqrl -brew install sqrl-cli -``` - -:::note -Check that you're on the current version of DataSQRL by running `sqrl --version` -To update an existing installation: - -```bash -brew upgrade sqrl-cli -``` -::: - - - - -Pull the latest Docker image to ensure you have the most recent version of DataSQRL: - -```bash -docker pull datasqrl/cmd:latest -``` - - - ## Implement SQL Script diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index 0849e517..4920b2ae 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -2,21 +2,54 @@ title: "Metrics Processing" --- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + + # DataSQRL Quickstart in 10 Minutes Metrics Monitoring Quickstart >| We are going to build a data pipeline that analyzes sensor metrics in 10 minutes. Tik tok, let's go! +Update quickstart to start with CREATE TABLE, then add data for sensor assignments to aggregate by machine. + +## Installation + +Install DataSQRL as a command on your machine by selecting the operating system below, or use Docker version of DataSQRL on any machine. + + + + + + +```bash +brew tap datasqrl/sqrl +brew install sqrl-cli +``` + + + + +Pull the latest Docker image to ensure you have the most recent version of DataSQRL: + +```bash +docker pull datasqrl/cmd:latest +``` + + + + + ## Create Script First, we are going to define the metrics processing for our data product using SQL. -:::info - -If you are unfamiliar with SQL, we recommend you read our [SQL Primer](/docs/reference/sqrl/sql-primer) first. -::: In the terminal or command line, create an empty folder for the SQL script: @@ -27,45 +60,41 @@ In the terminal or command line, create an empty folder for the SQL script: Then create a new file called `metrics.sqrl` and copy-paste the following SQL code: ```sql title=metrics.sqrl -IMPORT datasqrl.example.sensors.SensorReading; -- Import metrics -IMPORT time.endOfSecond; -- Import time function -/* Aggregate sensor readings to second */ -SecReading := SELECT sensorid, endOfSecond(time) as timeSec, - avg(temperature) as temp - FROM SensorReading GROUP BY sensorid, timeSec; -/* Get max temperature in last minute per sensor */ -SensorMaxTemp := SELECT sensorid, max(temp) as maxTemp - FROM SecReading - WHERE timeSec >= now() - INTERVAL 1 MINUTE - GROUP BY sensorid; +CREATE TABLE AddSensorReading ( + sensorid INT NOT NULL, + temperature DOUBLE NOT NULL +); +/* convert temperature to Fahrenheit */ +SensorReading := SELECT *, temperature*1.8+32 AS temperatureF + FROM AddSensorReading ORDER BY event_time DESC; +/* Compute avg and max temp per minute time window */ +SensorMinuteTemp := SELECT sensorid, endOfMinute(event_time) as timeMin, + avg(temperature) as avgTemp, max(temperature) as maxTemp + FROM AddSensorReading GROUP BY sensorid, timeMin; ``` -DataSQRL's flavor of SQL is called "SQRL", which defines tables using the `:=` assignment operator and supports explicit data and function imports. +DataSQRL's flavor of SQL is called "SQRL", which extends ANSI SQL with some convenient syntax like the `:=` assignment operator to define derived tables. -In the script, we import the sensor data we are processing and a time function we use for aggregation. +In this script, we create the `AddSensorReading` table to collect sensor metrics. +We define the `SensorReading` table augments the collected sensor metrics with the temperature in Fahrenheit. +We define another table `SensorMinuteTemp` which computes the average and maximum temperature for every minute time window per sensor. -We define the `SecReading` table that aggregates all sensor metrics within one second to smooth our temperature readings. -We define another table `SensorMaxTemp` which computes the maximum temperature in the last minute for each sensor. - -## Compile the Script {#run} - -DataSQRL compiles our SQRL script into an integrated data pipeline with the following command: - -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; +## Run the Script {#run} +DataSQRL compiles our SQRL script into an integrated data microservice that orchestrates Apache Kafka, Flink, Postgres, and GraphQL server. Run it with the following command: ```bash -sqrl compile metrics.sqrl +sqrl run metrics.sqrl ``` ```bash docker run -it -v $PWD:/build datasqrl/cmd compile metrics.sqrl +(cd build/deploy; docker compose up --build -d) ``` :::note @@ -82,11 +111,6 @@ If you are using Powershell on Windows, you need to replace `$PWD` with `${PWD}` -## Run the Script {#run} -```bash -(cd build/deploy; docker compose up --build -d) -``` - Once the pipeline is running, it will ingest, process, store, and serve the data through an API. :::note @@ -95,276 +119,187 @@ We'll start up postgres, flink, kafka, and a graphql server. You may have other ::: -## Query API {#query} +## Access the API {#query} -Open your favorite browser and navigate to [`http://localhost:8888/graphiql/`](http://localhost:8888//graphiql/) to access GraphiQL - a popular GraphQL IDE. Write GraphQL queries in the left-hand panel. For example, copy the following query: +Open your browser and navigate to [`http://localhost:8888/graphiql/`](http://localhost:8888//graphiql/) to access GraphiQL - a popular GraphQL IDE - through which you can access the GraphQL API. Write GraphQL queries in the left-hand panel. + +### Insert Data with Mutations + +To record sensor metrics, copy the following query: ```graphql -{ - SensorMaxTemp(sensorid: 1) { - maxTemp +mutation { + AddReading(metric: { + sensorid: 1, + temperature: 37.2 + }) { + event_time } } ``` -When you hit the "run" button you get the maximum temperature for the sensor with id `1` in the last minute. +Hit the "run" button to execute the mutation. -And there you have it: a running data pipeline that ingests metrics, aggregates them, and exposes the results through a GraphQL API which you can call in your applications. +### Query Data -To stop the pipeline, interrupt it with `CTRL-C` and run `(cd build/deploy; docker compose down -v)`. +Now we can query the data with: -## Customize API +```graphql +{ + SensorReading(sensorid: 1) { + temperature + temperatureF + event_time + } +} +``` -Got a little more time? Let's customize the GraphQL API and add a metrics ingestion endpoint. +### Realtime Subscriptions -By default, DataSQRL generates a GraphQL schema for us based on the tables we define in the SQRL script. That's great for rapid prototyping, but eventually we want to customize the API and limit data access. +DataSQRL supports GraphQL subscription, so we can push processed data to the user in realtime instead of the user having to query for it. This is useful when we want to update dashboards with new metrics automatically and in realtime. -To save us time, we are going to start with the GraphQL API that DataSQRL generates for us by running this command: +We are going to subscribe to the `SensorReading` table and trigger a new metric through the mutation. To see this happening in realtime, open two browser windows and navigate to [`http://localhost:8888/graphiql/`](http://localhost:8888//graphiql/) so you can see them both side-by-side. - - - -```bash -sqrl compile metrics.sqrl --api graphql +On one, start the graphql subscription: +```graphql +subscription { + SensorReading(sensorid: 2) { + temperature + temperatureF + event_time + } +} ``` - - -```bash -docker run --rm -v $PWD:/build datasqrl/cmd compile metrics.sqrl --api graphql +On the other, fire off a mutation: +```graphql +mutation { + AddReading(metric: { + sensorid: 2, + temperature: 52.8 + }) { + event_time + } +} ``` - - - -There is now a file called `schema.graphqls` in the same folder as our script. Open it and take a look. -Notice, how each table defined in our SQRL script maps to a query endpoint in the API and an associated result type. The query endpoints accept arguments for each column of the table to filter the results by column values. +The data shows up almost immediately in the query as the data is pushed through the subscription. -We are going to remove most of those arguments to only support querying by `sensorid`. We will also remove the `SensorReading` query endpoint and result type to only expose the smoothed-out sensor readings from the `SecReading` table. +### Consistent Data Processing -In the `schema.graphqls` file, remove the `SensorReading` type and replace the query definition with the following: +Let's try querying the minute temperature aggregates: -```graphql title=metricsapi.graphqls -type Query { - SecReading(sensorid: Int!): [SecReading!] - SensorMaxTemp(sensorid: Int): [SensorMaxTemp!] +```graphql +{ + SensorMinuteTemp(sensorid: 1) { + timeMin + avgTemp + maxTemp + } } ``` -Note, that we made `sensorid` a required argument for the `SecReading` query endpoint. +You get an empty result, even if you wait a minute for the time window to close. DataSQRL guarantees that the compiled data pipeline is consistent by default, which means that time windows don't close until additional data comes in to guarantee all data has been received. This gives us consistency guarantees in case of delayed data or system outages. + +After waiting a minute, try adding more sensor metrics with the mutation above. If you run the `SensorMinuteTemp` query again, you'll see the result. + +And there you have it: a running data pipeline that ingests metrics, aggregates them, and exposes the results through a GraphQL API which you can call in your applications. -Now, invoke the compiler with the GraphQL schema we just created and launch the updated pipeline: - -```bash -sqrl compile metrics.sqrl schema.graphqls -``` +To stop the pipeline, interrupt it with `CTRL-C`. +To stop the pipeline, interrupt it with `CTRL-C` and then take down the pipeline with: ```bash -docker run -it -v $PWD:/build datasqrl/cmd compile metrics.sqrl schema.graphqls +(cd build/deploy; docker compose down -v) ``` + -Followed By: -```bash -(cd build/deploy; docker compose up --build -d) -``` - -When you refresh GraphiQL in the browser, you see that the API is simpler and only exposes the data for our use case. +## Connecting Data Sources {#source} -## Ingest Metrics with Mutations +In addition to ingesting and querying data through the API, DataSQRL also supports importing and exporting data from/to external data sources like Kafka topics, files, database, Iceberg tables, etc. -So far, we have ingested metrics data from an external source imported from the [DataSQRL repository](http://dev.datasqrl.com). The data source is static which is convenient for whipping up an example data product, but we want our data pipeline to provide a metrics ingestion endpoint. - -No problem, let's add it to our GraphQL schema by appending the following mutation to the `schema.graphqls` file we created above -```graphql title=schema.graphqls -type Mutation { - AddReading(metric: SensorReadingInput!): CreatedReading -} +We are resolving this import against the DataSQRL Repository where the connector configuration is stored for easy access. You can also configure connectors locally. -input SensorReadingInput { - sensorid: Int! - temperature: Float! - humidity: Float! -} +The import defines a table `SensorAssignment` which contains a CDC stream of updates to the database table. -type CreatedReading { - event_time: String! - sensorid: Int! -} +```sql +SensorAssignment := DISTINCT SensorAssignment ON sensorid ORDER BY updated DESC; +MachineMinuteTemp := SELECT machineid, endOfMinute(event_time) as timeMin, + avg(temperature) as avgTemp, max(temperature) as maxTemp + FROM SensorReading r + TEMPORAL JOIN SensorAssignment a ON r.sensorid = a.sensorid + GROUP BY machineid, timeMin; ``` -To use the data created by this mutation in our SQRL script, we have to import it. Replace the first three lines of the `metrics.sqrl` script with: +The first statement deduplicates the update stream so that we get the most current assignment for each sensor. That table is then temporally joined to the `SensorReading` stream to identify the machine a sensor was assigned to at the time of the reading. The temporal join ensures that the join is consistent in time. We then aggregate the data for each machine by minute time window. -```sql title=metrics.sqrl -IMPORT schema.AddReading AS SensorReading; -IMPORT time.endOfSecond; -SecReading := SELECT sensorid, endOfSecond(event_time) as timeSec, - avg(temperature) as temp - FROM SensorReading GROUP BY sensorid, timeSec; -``` -We are now using data ingested through the API mutation endpoint instead of the static example data. And for the timestamp on the metrics, we are using the special column `event_time` which captures the time data was ingested through the API. +## Customize API {#api} + +By default, DataSQRL generates a GraphQL schema for us based on the tables we define in the SQRL script. That's great for rapid prototyping, but eventually we want to customize the API or limit data access. + +To save us time, we are going to start with the GraphQL API that DataSQRL generates for us by running this command: -Terminate the running service, run the compiler again, and re-launch the pipeline. -```bash -(cd build/deploy; docker compose down -v) -``` ```bash -sqrl compile metrics.sqrl schema.graphqls +sqrl compile metrics.sqrl --api graphql ``` ```bash -docker run -it -v $PWD:/build datasqrl/cmd compile metrics.sqrl schema.graphqls +docker run --rm -v $PWD:/build datasqrl/cmd compile metrics.sqrl --api graphql ``` -```bash -(cd build/deploy; docker compose up --build -d) -``` - -In GraphiQL, run the following mutation to add a temperature reading: -```graphql -mutation { - AddReading(metric: { - sensorid: 1, - temperature: 37.2, - humidity: 88 - }) { - sensorid - event_time - } -} -``` +There is now a file called `schema.graphqls` in the same folder as our script. Open it and take a look. -Hit the run button a few times and change the temperature and/or sensor id to insert multiple readings. +Notice, how each table defined in our SQRL script maps to a query endpoint in the API and an associated result type. The query endpoints accept arguments for each column of the table to filter the results by column values. -To query the maximum temperatures, run the following query: +Let's rename the GraphQL schema file to `metrics.graphqls` and change it to fit our API requirements. For example, we are going to remove all but the `sensorid` argument from the `SensorReading` query and make that argument required: -```graphql -{ - SensorMaxTemp { - sensorid - maxTemp - } +```graphql title=metricsapi.graphqls +type Query { + SensorReading(sensorid: Int!): [SensorReading!] + ... } ``` -## Realtime Updates with Subscriptions {#subscription} +When we supply the modified GraphQL schema as an argument to DataSQRL, it will compile a data pipeline that exposes exactly that API: -DataSQRL supports GraphQL subscription, so we can push processed data to the user in realtime instead of the user having to query for it. This is useful when we want to update dashboards with new metrics automatically and in realtime. - -Let's add an alert when the temperature of a sensor exceeds 50ยฐ. First, we add the `HighTempAlert` table to our script: -```sql title=metrics.sqrl -HighTempAlert := SELECT * FROM SecReading WHERE temp > 50; -``` - -Open the `schema.graphqls` file and add the following subscription and type: - -```graphql title=schema.graphqls -type Subscription { - HighTempAlert(sensorid: Int): HighTempAlert -} - -type HighTempAlert { - sensorid: Int! - timeSec: String! - temp: Float! -} -``` - -Terminate and rerun the pipeline: -```bash -(cd build/deploy; docker compose down -v) -``` ```bash -sqrl compile metrics.sqrl schema.graphqls +sqrl run metrics.sqrl schema.graphqls ``` ```bash -docker run -it -v $PWD:/build datasqrl/cmd compile metrics.sqrl schema.graphqls +docker run -it -v $PWD:/build datasqrl/cmd compile metrics.sqrl schema.graphqls; +(cd build/deploy; docker compose up --build -d) ``` -```bash -(cd build/deploy; docker compose up --build -d) -``` - -This allows users of our API to subscribe to the `HighTempAlert` table with an optional `sensorid` argument to only receive alerts for a particular sensor. Whenever a sensor reading exceeds 50ยฐ, the user will be immediately notified. +When you refresh GraphiQL in the browser, you see the updated API. -Open two browser windows and navigate to [`http://localhost:8888/graphiql/`](http://localhost:8888//graphiql/) so you can see them both. - -On one, start the graphql subscription: -```graphql -subscription { - HighTempAlert(sensorid: 2) { - sensorid - temp - timeSec - } -} -``` - -On the other, fire off a mutation: -```graphql -mutation { - AddReading(metric: { - sensorid: 2, - temperature: 90.5, - humidity: 88 - }) { - sensorid - event_time - } -} -``` - -Wait a second and fire off a second one: -```graphql -mutation { - AddReading(metric: { - sensorid: 2, - temperature: 95.5, - humidity: 88 - }) { - sensorid - event_time - } -} -``` - -Voila, we just built a fully-functioning monitoring service that ingests, aggregates, and serves metrics data in realtime with push-based alerts. And the best part? The DataSQRL compiler ensures that it is efficient, fast, robust, and scalable. - \ No newline at end of file +Voila, we just built a fully-functioning monitoring service that ingests, aggregates, and serves metrics data in realtime with push-based notifications. And the best part? The DataSQRL compiler ensures that it is efficient, fast, consistent, robust, and scalable. diff --git a/docs/getting-started/tutorials/overview.md b/docs/getting-started/tutorials/overview.md index 00cb474e..af9cc1e2 100644 --- a/docs/getting-started/tutorials/overview.md +++ b/docs/getting-started/tutorials/overview.md @@ -1,16 +1,15 @@ -[//]: # (# DataSQRL Tutorials) +# DataSQRL Tutorials -[//]: # () -[//]: # (The tutorials implement a use case with DataSQRL. We currently cover the following use cases:) -[//]: # () -[//]: # (* [**Customer 360**](../customer360/intro): Build a customer 360 by integrating customer data.) +The tutorials implement a use case with DataSQRL and provide step-by-step instructions: -[//]: # (* [**Recommendations**](../recommendations/intro): Build a content recommendation engine based on click-stream data.) +* [**Metrics Processing**](../../quickstart): Ingest and aggregate metrics data. +* [**Recommendations**](../recommendations/intro): Build a content recommendation engine based on click-stream data. -[//]: # (* [**Internet of Things**](../iot/intro): Aggregate and analyze sensor data from a factory floor.) +## DataSQRL Examples -[//]: # () -[//]: # (We'll be adding more tutorials with time, so check back soon. If you'd like to see another use case covered, please) +The [DataSQRL Examples](https://github.com/DataSQRL/datasqrl-examples/) GitHub repository contains a number of DataSQRL projects for various use cases. +You can clone the repository to play with the examples and adjust them to your needs: -[//]: # ([let us know](/community).) \ No newline at end of file +* [**Credit Card Transactions**](https://github.com/DataSQRL/datasqrl-examples/): Enriches a stream of credit card transactions, aggregates them, and provides a user-facing GraphQL API. +* [**Credit Card Rewards**](https://github.com/DataSQRL/datasqrl-examples/): Calculates the credit card rewards for users and provides recommendations for credit cards based on potential rewards. diff --git a/docs/getting-started/tutorials/recommendations/intro.md b/docs/getting-started/tutorials/recommendations/intro.md index b6bdc8e7..2868793f 100644 --- a/docs/getting-started/tutorials/recommendations/intro.md +++ b/docs/getting-started/tutorials/recommendations/intro.md @@ -1,197 +1,197 @@ -[//]: # (---) +--- -[//]: # (title: "Recommendations") +title: "Recommendations" -[//]: # (---) +--- -[//]: # () -[//]: # (# Content Recommendation through Clickstream Analysis) -[//]: # () -[//]: # (Content Recommendation >) +# Content Recommendation through Clickstream Analysis -[//]: # () -[//]: # (We are going to build a content recommendation engine for a fictitious literature website. The site has amazing content like "The Heart is Deceitful Above All Things" - fake data generators sometimes come up with profound insights.) -[//]: # () -[//]: # (Our recommendation engine uses the website logs to extract the clickstream of the pages that a user visits. We recommend other content based on what pages users visit next. And, of course, we want all of this to work in real-time so our users don't miss out on the latest trends in literature.) +Content Recommendation > -[//]: # () -[//]: # () -[//]: # (## Run SQRL Script {#run}) -[//]: # () -[//]: # (In the terminal or command line, create an empty folder for the SQRL script:) +We are going to build a content recommendation engine for a fictitious literature website. The site has amazing content like "The Heart is Deceitful Above All Things" - fake data generators sometimes come up with profound insights. -[//]: # () -[//]: # (```bash) -[//]: # (> mkdir clickstream; cd clickstream) +Our recommendation engine uses the website logs to extract the clickstream of the pages that a user visits. We recommend other content based on what pages users visit next. And, of course, we want all of this to work in real-time so our users don't miss out on the latest trends in literature. -[//]: # (```) -[//]: # () -[//]: # (Create a new file in that folder called `clickstream.sqrl` and paste the following content into the file (we'll explain it line-by-line [below](#sqrl)):) -[//]: # () -[//]: # (```sql) +## Run SQRL Script {#run} -[//]: # (IMPORT datasqrl.tutorials.clickstream.Click; -- Import data) -[//]: # (/* Most visited pages in the last day */) +In the terminal or command line, create an empty folder for the SQRL script: -[//]: # (Trending := SELECT url, count(1) AS total) -[//]: # ( FROM Click WHERE timestamp > now() - INTERVAL 1 DAY) +```bash -[//]: # ( GROUP BY url ORDER BY total DESC;) +> mkdir clickstream; cd clickstream -[//]: # (/* Find next page visits within 10 minutes */) +``` -[//]: # (VisitAfter := SELECT b.url AS beforeURL, a.url AS afterURL,) -[//]: # ( a.timestamp AS timestamp) +Create a new file in that folder called `clickstream.sqrl` and paste the following content into the file (we'll explain it line-by-line [below](#sqrl)): -[//]: # ( FROM Click b JOIN Click a ON b.userid=a.userid AND) -[//]: # ( b.timestamp <= a.timestamp AND) +```sql -[//]: # ( b.timestamp >= a.timestamp - INTERVAL 10 MINUTE;) +IMPORT datasqrl.tutorials.clickstream.Click; -- Import data -[//]: # (/* Recommend pages that are visited shortly after */) +/* Most visited pages in the last day */ -[//]: # (Recommendation := SELECT beforeURL AS url, afterURL AS rec,) +Trending := SELECT url, count(1) AS total -[//]: # ( count(1) AS frequency FROM VisitAfter) + FROM Click WHERE timestamp > now() - INTERVAL 1 DAY -[//]: # ( GROUP BY url, rec ORDER BY url ASC, frequency DESC;) + GROUP BY url ORDER BY total DESC; -[//]: # (```) +/* Find next page visits within 10 minutes */ -[//]: # () -[//]: # (Now run the DataSQRL compiler to build a recommendation engine from the data transformations and aggregations defined in the script:) +VisitAfter := SELECT b.url AS beforeURL, a.url AS afterURL, -[//]: # () -[//]: # (```bash) + a.timestamp AS timestamp -[//]: # (docker run --rm -v $PWD:/build datasqrl/cmd compile metrics.sqrl) + FROM Click b JOIN Click a ON b.userid=a.userid AND -[//]: # (```) + b.timestamp <= a.timestamp AND -[//]: # () -[//]: # (To run the recommendation engine, execute:) + b.timestamp >= a.timestamp - INTERVAL 10 MINUTE; -[//]: # () -[//]: # (```bash) +/* Recommend pages that are visited shortly after */ -[//]: # ((cd build/deploy; docker compose up)) +Recommendation := SELECT beforeURL AS url, afterURL AS rec, -[//]: # (```) + count(1) AS frequency FROM VisitAfter -[//]: # () -[//]: # (## Query Data API {#query}) + GROUP BY url, rec ORDER BY url ASC, frequency DESC; -[//]: # () -[//]: # (The running data pipeline compiled by DataSQRL exposes a GraphQL data API which you can access by opening [`http://localhost:8888/graphiql/`](http://localhost:8888/graphiql/) in your browser. Write GraphQL queries in the left-hand panel. For example, copy the following query:) +``` -[//]: # () -[//]: # (```graphql) -[//]: # ({) +Now run the DataSQRL compiler to build a recommendation engine from the data transformations and aggregations defined in the script: -[//]: # ( Recommendation(url: "mascot_books/a_time_of_gifts") {) -[//]: # ( rec) +```bash -[//]: # ( frequency) +docker run --rm -v $PWD:/build datasqrl/cmd compile metrics.sqrl -[//]: # ( }) +``` -[//]: # (}) -[//]: # (```) +To run the recommendation engine, execute: -[//]: # () -[//]: # (When you hit the "run" button you get the recommendations for the given page URL ordered by the frequency of correlated visit. ) -[//]: # (You now have a working recommendation engine you can integrate into your application.) +```bash -[//]: # () -[//]: # (## Description of SQRL Script {#sqrl}) +(cd build/deploy; docker compose up) -[//]: # () -[//]: # (Let's have a closer look at the SQRL script for our content recommendation engine and dissect what it does.) +``` -[//]: # () -[//]: # (:::info) -[//]: # () -[//]: # (SQRL is an extension of SQL, and we are going to use some basic SQL syntax. If you are unfamiliar with SQL, we recommend you read our [SQL Primer](/docs/reference/sqrl/sql-primer) first.) +## Query Data API {#query} -[//]: # () -[//]: # (:::) -[//]: # () -[//]: # (```sql) +The running data pipeline compiled by DataSQRL exposes a GraphQL data API which you can access by opening [`http://localhost:8888/graphiql/`](http://localhost:8888/graphiql/) in your browser. Write GraphQL queries in the left-hand panel. For example, copy the following query: -[//]: # (IMPORT datasqrl.example.clickstream.Click;) -[//]: # (```) +```graphql -[//]: # () -[//]: # (This import statement imports the `Click` table from the package [datasqrl.example.clickstream](https://dev.datasqrl.com/package/datasqrl.example.clickstream/). The `Click` table contains records of user clicks on the content URLs, including the user ID, URL, and the timestamp of the visit.) +{ -[//]: # () -[//]: # (```sql) + Recommendation(url: "mascot_books/a_time_of_gifts") { -[//]: # (Trending := SELECT url, count(1) AS total) + rec -[//]: # ( FROM Click WHERE timestamp > now() - INTERVAL 1 DAY) + frequency -[//]: # ( GROUP BY url ORDER BY total DESC;) + } -[//]: # (```) +} -[//]: # () -[//]: # (This statement defines the `Trending` table, which shows the most visited pages in the last day. It selects the url and the count of clicks on each URL (as total) from the `Click` table, filtering records with a timestamp within the last day. The resulting table is grouped by the url and ordered by the total number of clicks in descending order.) +``` -[//]: # () -[//]: # (```sql) -[//]: # (VisitAfter := SELECT b.url AS beforeURL, a.url AS afterURL,) +When you hit the "run" button you get the recommendations for the given page URL ordered by the frequency of correlated visit. -[//]: # ( a.timestamp AS timestamp) +You now have a working recommendation engine you can integrate into your application. -[//]: # ( FROM Click b JOIN Click a ON b.userid=a.userid AND) -[//]: # ( b.timestamp <= a.timestamp AND) +## Description of SQRL Script {#sqrl} -[//]: # ( b.timestamp >= a.timestamp - INTERVAL 10 MINUTE;) -[//]: # (```) +Let's have a closer look at the SQRL script for our content recommendation engine and dissect what it does. -[//]: # () -[//]: # (The `VisitAfter` table identifies pairs of URLs that were visited by the same user within a 10-minute interval. It selects the url from the `Click` table (aliased as b) as beforeURL, the url from the `Click` table (aliased as a) as afterURL, and the timestamp of the afterURL click as timestamp. The JOIN condition ensures that the click records have the same userid, and the timestamp of the beforeURL click is within 10 minutes of the afterURL click.) -[//]: # () -[//]: # (```sql) +:::info -[//]: # (Recommendation := SELECT beforeURL AS url, afterURL AS rec,) -[//]: # ( count(1) AS frequency FROM VisitAfter) +SQRL is an extension of SQL, and we are going to use some basic SQL syntax. If you are unfamiliar with SQL, we recommend you read our [SQL Primer](/docs/reference/sqrl/sql-primer) first. -[//]: # ( GROUP BY url, rec ORDER BY url ASC, frequency DESC;) -[//]: # (```) +::: -[//]: # () -[//]: # (The `Recommendation` table generates recommendations for pages that are frequently visited shortly after visiting another page. It selects the beforeURL as url, the afterURL as rec, and the count of occurrences of each pair as frequency from the `VisitAfter` table. The resulting table is grouped by url and rec, and ordered by url in ascending order and frequency in descending order. This provides a list of recommended pages for each URL based on the frequency of co-visits within a 10-minute interval.) -[//]: # () -[//]: # (And that's all you need to build a basic recommendation engine that recommends trending pages and related pages based on co-visits by users. ) +```sql -[//]: # () -[//]: # (## Next Steps {#next}) +IMPORT datasqrl.example.clickstream.Click; -[//]: # () -[//]: # (Read the [DataSQRL introduction](../../../intro/overview) which is an in-depth tutorial of DataSQRL and gives you all the information you need to extend this recommendation engine and suit it to your needs.) +``` + + +This import statement imports the `Click` table from the package [datasqrl.example.clickstream](https://dev.datasqrl.com/package/datasqrl.example.clickstream/). The `Click` table contains records of user clicks on the content URLs, including the user ID, URL, and the timestamp of the visit. + + +```sql + +Trending := SELECT url, count(1) AS total + + FROM Click WHERE timestamp > now() - INTERVAL 1 DAY + + GROUP BY url ORDER BY total DESC; + +``` + + +This statement defines the `Trending` table, which shows the most visited pages in the last day. It selects the url and the count of clicks on each URL (as total) from the `Click` table, filtering records with a timestamp within the last day. The resulting table is grouped by the url and ordered by the total number of clicks in descending order. + + +```sql + +VisitAfter := SELECT b.url AS beforeURL, a.url AS afterURL, + + a.timestamp AS timestamp + + FROM Click b JOIN Click a ON b.userid=a.userid AND + + b.timestamp <= a.timestamp AND + + b.timestamp >= a.timestamp - INTERVAL 10 MINUTE; + +``` + + +The `VisitAfter` table identifies pairs of URLs that were visited by the same user within a 10-minute interval. It selects the url from the `Click` table (aliased as b) as beforeURL, the url from the `Click` table (aliased as a) as afterURL, and the timestamp of the afterURL click as timestamp. The JOIN condition ensures that the click records have the same userid, and the timestamp of the beforeURL click is within 10 minutes of the afterURL click. + + +```sql + +Recommendation := SELECT beforeURL AS url, afterURL AS rec, + + count(1) AS frequency FROM VisitAfter + + GROUP BY url, rec ORDER BY url ASC, frequency DESC; + +``` + + +The `Recommendation` table generates recommendations for pages that are frequently visited shortly after visiting another page. It selects the beforeURL as url, the afterURL as rec, and the count of occurrences of each pair as frequency from the `VisitAfter` table. The resulting table is grouped by url and rec, and ordered by url in ascending order and frequency in descending order. This provides a list of recommended pages for each URL based on the frequency of co-visits within a 10-minute interval. + + +And that's all you need to build a basic recommendation engine that recommends trending pages and related pages based on co-visits by users. + + +## Next Steps {#next} + + +Read the [DataSQRL introduction](../../../intro/overview) which is an in-depth tutorial of DataSQRL and gives you all the information you need to extend this recommendation engine and suit it to your needs. diff --git a/docs/intro.md b/docs/intro.md index 0a26eecc..3ff465aa 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -5,123 +5,36 @@ import TabItem from '@theme/TabItem'; DataSQRL Documentation > -DataSQRL is an open-source data development framework for building data pipelines, event-driven microservices, and AI data products. It provides the basic structure, common patterns, and a set of tools for streamlining the development of data solutions. +DataSQRL is an open-source data development framework for building data pipelines, event-driven microservices, and AI data products. It provides the basic structure, common patterns, and a set of tools for streamlining the development of data products. DataSQRL automates data plumbing and integrates data technologies like Apache Kafka, Flink, Postgres, DuckDB, GraphQL, and more. -DataSQRL is an open-source framework that you can use to build and deploy data pipelines. Get started with: +DataSQRL can ingest data directly or from an external source, process the data with SQL, Java, & Python, and expose it through an API, database view, or stream to consumers. - - - - -```bash -brew tap datasqrl/sqrl -brew install sqrl-cli -``` - -:::note -Check that you're on the current version of DataSQRL by running `sqrl --version` -To update an existing installation: - -```bash -brew upgrade sqrl-cli -``` -::: - - - -Always pull the latest Docker image to ensure you have the most recent updates: - -```bash -docker pull datasqrl/cmd:latest -``` - - - - -With DataSQRL, there's no need for complex backend setups or managing multiple services. Just add DataSQRL in your development process, define your data processing in SQL, and let DataSQRL compile everything into an optimized, scalable data pipeline. DataSQRL handles everything from data ingestion from various sources like databases, streams, and APIs to processing and exposing the data via APIs or directly to databases. - -Here are a few example pipelines: - - - - -```sql -IMPORT datasqrl.tutorials.seedshop.Orders; -- Import orders stream -IMPORT time.endOfWeek; -- Import time function - -/* Create new table of unique customers */ -Users := SELECT DISTINCT customerid AS id FROM Orders; - -/* Create relationship between customers and orders */ -Users.purchases := JOIN Orders ON Orders.customerid = @.id; - -/* Aggregate the purchase history for each user by week */ -Users.spending := SELECT endOfWeek(p.time) AS week, - sum(i.quantity * i.unit_price) AS spend - FROM @.purchases p JOIN p.items i - GROUP BY week ORDER BY week DESC; -``` - - - -```sql -IMPORT datasqrl.tutorials.clickstream.Click; -- Import data - -/* Find next page visits within 10 minutes */ -VisitAfter := SELECT b.url AS beforeURL, a.url AS afterURL, - a.timestamp AS timestamp - FROM Click b JOIN Click a ON b.userid=a.userid AND - b.timestamp < a.timestamp AND - b.timestamp >= a.timestamp - INTERVAL 10 MINUTE; - -/* Recommend pages that are frequently co-visited */ -Recommendation := SELECT beforeURL AS url, afterURL AS rec, - count(1) AS frequency FROM VisitAfter - GROUP BY url, rec ORDER BY url ASC, frequency DESC; -``` - - - -```sql -IMPORT datasqrl.tutorials.sensors.*; -- Import sensor data -IMPORT time.endOfSecond; -- Import time function - -/* Aggregate sensor readings to second */ -SecReading := SELECT sensorid, endOfSecond(time) as timeSec, - avg(temperature) as temp FROM SensorReading - GROUP BY sensorid, timeSec; - -/* Get max temperature in last minute */ -SensorMaxTemp := SELECT sensorid, max(temp) as maxTemp - FROM SecReading - WHERE timeSec >= now() - INTERVAL 1 MINUTE - GROUP BY sensorid; -``` - - +* [Get Started](getting-started/quickstart) building with DataSQRL. +* Follow the [DataSQRL Tutorials](getting-started/tutorials/overview) for step-by-step instructions and see what you can build with DataSQRL. ## Why use DataSQRL? + - Simplified Data Engineering: Automate the integration of multiple data technologies and eliminate the "data plumbing" needed to build data pipelines. - Runs Everywhere: DataSQRL runs in cloud environments, local servers, and even on large-scale data infrastructure. - User-Friendly SQL Interface: Focus on what your data should do, not how it's processed. DataSQRL's SQL-based interface is accessible to beginners and powerful for experts. -You can also learn more about [DataSQRL](../getting-started/concepts/datasqrl), [its benefits](../getting-started/concepts/why-datasqrl), and [whether to use DataSQRL](../getting-started/concepts/when-datasqrl) for your data product. +You can also learn more about: +* [What DataSQRL](../getting-started/concepts/datasqrl) is, +* [Why DataSQRL](../getting-started/concepts/why-datasqrl) exists, and +* [Types of Data Architectures](../architectures/intro) DataSQRL supports. -## Getting Started with DataSQRL +## How to use DataSQRL? -- Jump into our [Getting Started Guide](../getting-started) for a quick introduction to the basics of DataSQRL. -- Explore the Interactive Tutorial and [Quickstart](../getting-started/quickstart) to see DataSQRL in action. -- Check out the [reference documentation](../reference/sqrl/datasqrl-spec) covers all aspects of DataSQRL in detail. If you want more information on how to use DataSQRL or are looking for comprehensive documentation, this is your place to go. -- Learn more about advanced features and customization in our How-to Guides. +* Follow one of the [DataSQRL Tutorials](../getting-started/tutorials/overview). +* Learn about the core [DataSQRL concepts](../reference/sqrl/learn) and how it all fits together. +* Check out the [Reference Documentation](../reference/introduction) for all the details. ## Join the DataSQRL community DataSQRL is an open-source project. If you want to learn more about the internals or contribute to the project: - Star us on [GitHub](https://github.com/DataSQRL/sqrl) and contribute to the project. -- Discuss, ask questions, and share your DataSQRL projects on Slack. +- Discuss, ask questions, and share your DataSQRL projects on [Slack](/community). - Report issues or suggest features directly on our [GitHub Issues](https://github.com/DataSQRL/sqrl/issues) page. -## Getting Information -Technology is only one piece to a successful data product. Learn more about the [DataSQRL process](../process/intro) for building [data products](../reference/concepts/data-product) that deliver value and keep your team sane. - ## Getting Help + Couldn't find what you were looking for? Need help or want to talk to somebody about your problem? No worries, the [DataSQRL community](/community) is here to help. \ No newline at end of file diff --git a/docs/process/intro.md b/docs/process/intro.md index 8febb172..2c5874d3 100644 --- a/docs/process/intro.md +++ b/docs/process/intro.md @@ -1,4 +1,4 @@ -# The DataSQRL Process +# The DataSQRL Framework Across many organizations, we observed that data products commonly struggle or fail because of: @@ -8,15 +8,15 @@ Across many organizations, we observed that data products commonly struggle or f * **Misaligned Process**: The process used to implement the data product is not focused on the needs of the customer. * **Political Interference**: Internal and external politics and governance requirements add significant friction or outright derail the project. -[DataSQRL](/docs/getting-started/concepts/datasqrl) simplifies the technology for building data products. To address the other two problem areas that frequently strike data product implementations, we developed the **DataSQRL Process**. +[DataSQRL](/docs/getting-started/concepts/datasqrl) not only simplifies the technology for building data products. To address the other two problem areas that frequently strike data product implementations, we developed the **DataSQRL Framework**. -The basic idea behind the DataSQRL process is to focus on value delivery and limit the impact of external factors. In the context of building data products, these goals can be surprisingly difficult to attain because a team's energy and attention is frequently drawn to implementation, orchestration, and planning issues. And managing the political ramifications of data projects can be an outright Kafkaesque experience. +The basic idea behind the DataSQRL framework is to focus on value delivery and limit the impact of external factors. In the context of building data products, these goals can be surprisingly difficult to attain because a team's energy and attention is frequently drawn to implementation, orchestration, and planning issues. And managing the political ramifications of data projects can be an outright Kafkaesque experience. ## Key Principles -The DataSQRL process is based on three key principles: +The DataSQRL framework is based on three key principles: -DataSQRL Process Key Principles > +DataSQRL Framework Key Principles > 1. [Customer-focused](../customer-focused): Focus on customer satisfaction through early and continuous delivery of valuable data products. 2. [Responsive](../responsive): Harness changing requirements and creative input from all stakeholders for competitive advantage. @@ -26,20 +26,20 @@ We distilled these principles from our work on a broad range of data product imp Maybe it's just us, but we find having a bit of light in the fog of a data product implementation to be very reassuring. -## Adopt the DataSQRL Process to Your Organization +## Adopt the DataSQRL Framework to Your Organization -The DataSQRL process is a framework and not a prescriptive process implementation. In line with key principle #3, we found that processes that are closely aligned with an organization's existing software development processes are the most likely to be successful. +The DataSQRL framework is not a prescriptive process implementation. In line with key principle #3, we found that processes that are closely aligned with an organization's existing software development processes are the most likely to be successful. We recommend developing a data product implementation process by applying the key principles to your existing software development process. The goal is to strike a balance between accommodating the unique characteristics of data product implementations and aligning with your existing development workflows. That does sound a bit wishy-washy. Unfortunately, there is some art in it. We will try to nail it down more as it matures. Until then, [we can help you with that](/services). -## Why Do We Need a New Process? +## Why Do We Need a New Framework? The unique requirements of data products - like data acquisition, model building, or data architecture design - often do not fit into an organization's existing software development process, which leads to the adoption of specific data science or data engineering processes. While those processes are well-suited to address these unique requirements, they often lose focus of customer needs and value delivery because of their complexity, lengthy planning cycles, and high implementation cost. In addition, the misalignment with an organization's development process results in high friction and operational overhead which can further delay value delivery. You win the battle but loose the war - and usually a lot of money. -The goal of the DataSQRL process is to shift the focus back to the needs of customers and value generation, while accommodating the unique requirements of data products and staying aligned with the existing development process. In other words, removing the distractions from building with data, so we can all have some fun again. +The goal of the DataSQRL framework is to shift the focus back to the needs of customers and value generation, while accommodating the unique requirements of data products and staying aligned with the existing development process. In other words, removing the distractions from building with data, so we can all have some fun again. -The DataSQRL process is a work in progress, and we continuously refine it as we learn from our community and customers. Please share your thoughts, opinions, and ideas by [joining the DataSQRL community](/community) or [working with us](/services). \ No newline at end of file +The DataSQRL framework is a work in progress, and we continuously refine it as we learn from our community and customers. Please share your thoughts, opinions, and ideas by [joining the DataSQRL community](/community) or [working with us](/services). \ No newline at end of file diff --git a/docs/reference/concepts/data-product.md b/docs/reference/concepts/data-product.md index 9b17a44f..783bf53d 100644 --- a/docs/reference/concepts/data-product.md +++ b/docs/reference/concepts/data-product.md @@ -1,13 +1,28 @@ --- -title: "Data Product" +title: "What's a Data Product?" --- # What is a Data Product? -A data product is a piece of software that processes data to deliver actionable, valuable insights or results. +A data product is a piece of software that processes data to deliver actionable, valuable insights or results. A data product produces tangible business value in a consistent, reliable, and sustainable manner. + Data products take raw data as input, apply a series of transformations, algorithmic processes, or analytics, and produces information that is useful to customers, decision makers, or business operations. Data products can range in complexity from simple analytics dashboards to personalized recommendation engines utilizing machine learning models and generative AI. +## Data Product Characteristics + +Data products encompasses a broad range of software that produces business value from data. +What makes such data-driven software a "product" is a match between customer demand and profitable delivery. +More specifically, what distinguishes a data product from a data project are the following characteristics: + +* **Customer Demand**: The data product solves a real and tangible pain for the customer. +* **Easy Access**: Customers can discover and access the data product within their regular workflows. +* **Easy Consumption**: Customers can use the data product to derive value. +* **Consistent Quality**: The data product is delivered with consistent, well-defined quality. +* **Reliable SLA**: The data product is delivered with reliable SLAs for latency, uptime, throughput, etc. +* **Sustainable**: The data product can be maintained and evolved over time. +* **Profitable**: The data product is profitable and has a return on investment. + ## What does a Data Product Consist Of? {#components} A data product implementation consists of multiple stages to go from raw data to valuable result. diff --git a/docs/reference/sqrl/learn.md b/docs/reference/sqrl/learn.md index bc7f5a30..b9ba86d9 100644 --- a/docs/reference/sqrl/learn.md +++ b/docs/reference/sqrl/learn.md @@ -1,5 +1,12 @@ -# DataSQRL +# How to Use DataSQRL + +To use DataSQRL, you follow these steps: + +* **Implement SQRL script** to ingest, transform, anlayze, and expose data. +* **Execute DataSQRL command** to compile, run, test, and deploy the implemented data architecture/pipeline. +* **Configure connectors** to ingest data from and write data to external data systems. +* **Configure DataSQRL** to customize the data architecture that DataSQRL compiles. (GraphQL, topologies) ## Introduction DataSQRL enhances the SQRL language to optimize data pipeline construction for developers. It features an advanced optimizer that efficiently directs queries across multiple integrated engines. This flexible, composable architecture allows for customization and scalability by enabling the integration of only the necessary engines. DataSQRL also offers comprehensive developer tools, including schema management, generating deployment assets, a repository for dependency management, efficient dependency handling, and more. These features make DataSQRL a versatile and powerful tool for developing and managing data pipelines. @@ -8,6 +15,49 @@ DataSQRL is an open-source project, which means you can view the [entire source Developer Documentation > +## Installation + +To compile and run SQRL script, you need the DataSQRL compiler. You can either install the DataSQRL command suitable for your machine, or use Docker on any machine. + +In addition to the DataSQRL compiler, the command version also includes a local development and testing runtime that significantly speeds up the development cycle and allows you to run automated tests. The Docker version only includes the compiler and builds the compiled data architecture with Docker Compose which simulates a production environment but can take minutes to start up. + + + + +```bash +brew tap datasqrl/sqrl +brew install sqrl-cli +``` + +:::note +Check that you're on the current version of DataSQRL by running `sqrl --version` +To update an existing installation: + +```bash +brew upgrade sqrl-cli +``` +::: + + + + +Pull the latest Docker image to ensure you have the most recent version of DataSQRL: + +```bash +docker pull datasqrl/cmd:latest +``` + + + + +## SQRL + +:::info + +If you are unfamiliar with SQL, we recommend you read our [SQL Primer](/docs/reference/sqrl/sql-primer) first. + +::: + ### Mission and Goals behind DataSQRL The fundamental mission of DataSQRL is to democratize the process of building efficient, scalable data products by making advanced data pipeline tools accessible and easy to use. Our goal is to empower developers by simplifying the data pipeline construction process, reducing the barrier to entry, and accelerating the path from development to production. diff --git a/sidebars.js b/sidebars.js index 7b2b8213..c03c9e44 100644 --- a/sidebars.js +++ b/sidebars.js @@ -25,14 +25,14 @@ const sidebars = { { type: 'doc', label: '๐Ÿš€ Getting Started', - id: 'getting-started/getting-started', + id: 'getting-started/quickstart', }, { type: 'category', label: '๐Ÿ“” Tutorials', link: { type: 'doc', - id: 'getting-started/quickstart', + id: 'getting-started/tutorials/overview', }, items: [ { @@ -70,16 +70,16 @@ const sidebars = { // 'getting-started/concepts/sqrl', "getting-started/concepts/why-datasqrl", 'getting-started/concepts/when-datasqrl', - 'reference/concepts/data-product', { type: 'category', - label: 'DataSQRL Process', + label: 'DataSQRL Framework', link: { type: 'doc', id: 'process/intro', }, collapsed: true, items: [ + 'reference/concepts/data-product', 'process/customer-focused', 'process/responsive', 'process/integrated' @@ -89,12 +89,12 @@ const sidebars = { }, { type: 'doc', - label: '๐Ÿ’ก Learn', + label: '๐Ÿ’ก How to Use DataSQRL', id: 'reference/sqrl/learn' }, { type: 'category', - label: '๐Ÿ”ช DataSQRL Reference', + label: '๐Ÿ“– DataSQRL Reference', collapsed: false, link: { type: 'doc', From 8ae1926cc5e19a489f5f70ef5ace7998059d4bfa Mon Sep 17 00:00:00 2001 From: Matthias Broecheler Date: Mon, 7 Oct 2024 13:14:03 -0700 Subject: [PATCH 2/2] remove old language --- docs/getting-started/concepts/datasqrl.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/getting-started/concepts/datasqrl.md b/docs/getting-started/concepts/datasqrl.md index 93a0e708..19ca30db 100644 --- a/docs/getting-started/concepts/datasqrl.md +++ b/docs/getting-started/concepts/datasqrl.md @@ -4,8 +4,6 @@ title: "What is DataSQRL?" # What is DataSQRL? -Copy-paste repo README - DataSQRL is a flexible data development framework for building various types of streaming data architectures, like data pipelines, event-driven microservices, and Kappa. It provides the basic structure, common patterns, and a set of tools for streamlining the development process of [data products](/docs/reference/concepts/data-product). DataSQRL integrates any combination of the following technologies: