Kinetica and NVIDIA Open Door for Fast Analysis of Large Geospatial DatasetsSeptember 19, 2017
We live in a world where data is being generated at an exponential rate. IoT systems alone might track millions of sensors, each reporting statuses at frequent intervals. These datasets typically contain time and location information, which provide useful context when viewed on a map.
But analyzing and visualizing large spatial datasets is a challenge. Today’s spatial databases weren’t designed for analysis of such large volumes of data. They simply aren’t able to ingest and analyze large volumes of streaming, location-aware data.
To do analysis on large datasets with any sort of interactivity, to visualize millions or billions of records, and to then filter and analyze them, you need a system that can overcome two fundamental challenges:
- Most databases are simply not designed to perform large-scale geospatial analytics in a reasonable amount of time.
Imagine trying to analyze millions of customer purchases and aggregating those based on ad hoc proximity to retail stores. The types of polygon intersection calculations are expensive. Multiply that by millions or billions of records, and your query may leave you waiting all week. Not a problem, perhaps, if you only need to do it once, but if you need to make repeated ad hoc queries on data, this is not acceptable.
- It’s hard to visualize large geospatial datasets through the web with any sort of interactivity.
Browsers struggle to handle more than a few thousand features, and it takes time to send large volumes of data over the wire. Send more than several thousand points or a thousand polygons to a browser and it will slow to a crawl. Eventually there is a threshold where it’s not practical to send the data over the wire for the browser to sort out.
The GPU-Accelerated Spatial Database
Parallelized processing on the GPU provides a path to overcoming performance challenges. Kinetica, the fastest GPU-accelerated database, is capable of spreading spatial computations across thousands of GPU nodes, multiple cards and multiple machines. This makes it exceptional for the types of brute-force calculations needed for advanced analytics of large and streaming geospatial-temporal datasets.
Kinetica is also adept at managing high-velocity streaming data — such as you might get from social media feeds, moving vehicles, smart meters and sensors. Each node within a Kinetica cluster can share the load of ingesting data, and since less indexing is required, data is available for query the moment it arrives.
For geospatial datasets, more is required. While any database can store geospatial coordinates as numbers, a separate system is then needed to retrieve these records, convert them into a geometry objects and evaluate queries. This is slow and inefficient, and a major bottleneck as datasets grow.
A “spatially aware database” has the geometry engine built in, and has native functions for filtering and working with geospatial data. Kinetica combines native geospatial functionality with GPU acceleration to deliver a database that can work with massive geospatial datasets and compute relationships between shapes and objects — all within a single system.
Kinetica comes with a suite of geospatial functions that run natively within the database. This makes it possible to get fast results on queries such as the following:
The challenge of how to visualize large datasets with interactivity still remains. If you’re sending more than a few thousand points or polygons across a wire to a mapping client, things are liable to grind to a crawl.
Kinetica solves this with a native geospatial web server capable of leveraging the GPU to quickly render vector-based map visualizations, on the fly. These dynamic map visualizations can be integrated with any OGC-compliant web mapping API to allow for interaction with features.
The Kinetica visualization API comes with the tools necessary to interact with, drill into and explore individual points and shapes on those maps. These can be overlaid on top of base-maps from ESRI, Google, Bing, Mapbox, etc.
Let’s explore a sample historic dataset of Twitter events that is approximately 4 billion unique records. The screen capture below illustrates a heat map visualization of all of those tweets, mostly concentrated in North America. All of these points were rendered on the fly, and in less than 500ms. Nothing was cached or rendered ahead of time.
We can explore the data further by providing some text search criteria to filter the results.
Below is a capture of the map results that is updated once an NLP text filter is applied on the term “organic.” There is a fairly dramatic change in the dataset, all of which happens in under 300ms. The number of results being displayed went from 4 billion-plus to approximately 289,000.
Here’s one of many of the results in the Los Angeles area of the dataset, confirming through the attributes of the tweet text that the NLP text search filtered the results correctly.
If you have existing apps or services that are based on ESRI, MapBox or a traditional open source option such as GeoServer. Kinetica maps support the open standards by way of OGC-compliant services. This means that integrating Kinetica map visualizations on millions/billions/trillions of spatial features in real time within your GIS ecosystem is as simple as a single URL string.
Try It Out Yourself
For an idea of what’s possible with geospatial data on Kinetica, take a tour of the Kinetica demo with several large datasets for analysis.
Advanced Visualization Options
The visualization layer also includes some more advanced functionality, including color-coded filtering. As an idea of what is possible with Kinetica, here’s a visualization of those tweets sorted by the year they were made.
Advanced Analytics and Machine Learning with Geospatial Data
Geospatial analysis can be further extended through the User-Defined Functions API — an interface that makes it possible for custom code to run from within the database. Through UDFs, almost any type of analysis is possible. Even for highly customized geospatial operations, the dataset does not need to be extracted into a separate system for analysis. Instead, models can be brought to the data, to be run “in-database.”
This opens a world of possibilities — custom code can even call out to machine learning libraries, such as TensorFlow, for advanced geospatial predictions. This might make it possible for deliveries to be flagged when they are unlikely to arrive on time — based on traffic, weather or other indicators. Insurance companies could better analyze drivers who are most likely to be involved in an accident based on driving behavior; or they could calculate risk for assets from weather models.
What Can You Build with This?
The combination of real-time streaming query, native geospatial operators and advanced map-based visualizations opens opportunities for businesses to perform analyses that were previously difficult or impossible.
The United States Postal Service leverages interactive geospatial analytics for tracking packages, personnel and route planning. Its 200,000-plus devices emit locations once every minute, amounting to more than a quarter billion events captured and analyzed daily. USPS’ parallel cluster serves 15,000 daily sessions, providing service managers and analysts with the capability to instantly see what’s happening in their areas of responsibility.
Other customers are using Kinetica for mapping infrastructure, logistics, customer research and more.
Learn more by joining NVIDIA and Kinetica at an upcoming webinar, “Advanced Analytics and Machine Learning with Geospatial Data: A world of possibilities” on October 5, at 10am PT. Register here.