You Can’t Do Machine Learning Inside a Database. Can You?
Yes, you can. Java, Python, and R algorithms can be trained, tested and put into production inside proprietary or open source analytical databases.
That’s not right. You have to use Spark or something if you want to do sophisticated machine learning.
Analytical databases can’t do time series analysis, geospatial, or things like random forest, SVM, clustering or logistic regression.
Actually, you can do all of that in a database. Most analytics-focused databases, both proprietary and open source, now offer in-database machine learning, BigQuery ML, Oracle,… Vertica also has windowing functions, pattern matching, and event series pattern matching.
Ok, maybe you can, but why would you?
- Fast Data Prep – Analytics databases have things like joining datasets, feature engineering, eliminating outliers, missing value imputation, etc. down pat.
- More Efficient Processing – Combining the computation engine and data storage management system eliminates the need to move data between a database and a statistical analysis tool. Moving your data around and transforming it costs CPU, IO, time, and money.
- Production Ready – You won’t have to redo all the work you did experimenting, training and testing, in some completely different technology to make it production ready. It already is.
- No Need to Downsample – Train and test models on the data right where it sits. No need to pull out something you hope is a representative sample to put it in some other technology you hope can handle the volume.
In-database machine learning would be really difficult to do, though, right?
Nope. Vertica, for instance, has optimized parallel machine learning algorithms built-in. All you have to do is call them in SQL, or you can use Python or Java APIs. Or, user-defined extensions let you build your own algorithms in Python, R, or Java, then you can call those new functions the same way.
But a database can’t handle Big Data. Huge data volumes can only be processed with something like Hadoop.
Modern MPP analytical databases are made to scale across multiple commodity servers, too, no special hardware required. And, databases brag about their high concurrency capabilities. Hundreds of users, or hundreds of visualization drilldown dashboards, can ping a good database with queries all at once without bogging it down. Try to get that from Hadoop.
But databases only deal with structured data. What if I need to do machine learning on semi-structured data or big data formats like ORC or Parquet?
Some databases have the ability to query tables outside their own storage format. Vertica, for instance, can query “external tables” aka, data stored in ORC or Parquet, and it also has the concept of “flex tables” aka, semi-structured data with a schema-on-read strategy similar to Hive.
Lots of companies have requirements to move to the Cloud now. Analytical databases don’t work in the Cloud, unless they’re specially built for it, like Amazon Redshift or Snowflake.
Lots of databases work fine in the Cloud. A database is software. If it’s made right, it should not be tied to any particular hardware or deployment option. You should decide, based on the needs of your business, whether to deploy on-premises, on a Cloud or multiple Clouds, or in some hybrid configuration. If your database only runs in the Cloud, or worse, only runs on one specific Cloud, that seriously limits your future options. Don’t do that to yourself.
Streaming data, though, like from IOT use cases. Databases can’t do constant parallel data loads from something like Kafka, and still do machine learning. Ha. Got you that time.
You mean streaming IOT use cases like predictive maintenance, network optimization, cybersecurity and fraud prevention? The kind that 7 out of the 10 biggest telecom companies on the planet do, for instance? You know, the ones that use Vertica.
Um, yeah. Like that. So, if I can do all that inside a database, and get all these advantages, why the heck have I been moving my data out of the database to do machine learning?
Beats me. Databases are made to efficiently store, retrieve, manipulate and analyze data.
Let databases do what databases do best.
See for yourself. Try Vertica for FREE.