Analysing million+ tweets using JSON Feature of Oracle Database 12c

In this blogpost, I will share my learning about JSON processing capabilities of Oracle database. My dataset contains 1.5 million tweets collected over 2 days for a given set of tokens (more about the tokens later in this post). And IMO it represents a good size of real world dataset for experiments rather than having a table with just few rows of mocked JSON data. The dataset has been collected using twitter streaming api. 
 
The collection script is part of my Social-Data-Aggregrator project being developed using Big Data tech and Python. The collection script is quite versatile and configurable. It connects to twitter streaming using Tweepy python package, filter tweets containing ‘interested’ tokens and write to a Kafka topic. The collected tweets are then pushed from Kafka to a local filesystem using dataWriter.py script, which is then loaded into Oracle 12c Database.

 

Database size and version

I chose a 12cR2 Database Cloud Service on Oracle Cloud for this analysis. It has 2 OCPU (2 Intel Cores) and 30 GB of Memory. 11GB of memory has been allocated to SGA. 

 

Data Source 

As mentioned above the source of data is Twitter stream filtered for the following tokens.
 
Shoe brands – Nike, Reebok, Adidas
Banks – UBS, Barclays, HSBC
Technology brands – Google, IBM, Microsoft, Oracle, MongoDB, Datastax
 

Database Objects Setup and Data Loading

We need an external table and map the file containing tweets. And a permanent table to then load data from External table to permanent table. The permanent table is optional as same SQL queries can be executed against the external table.
 
CREATE DIRECTORY twitter_data as '/u01/app/oracle/tweets';

CREATE TABLE tweet_external_table (tweet clob)
ORGANIZATION EXTERNAL (TYPE ORACLE_LOADER DEFAULT DIRECTORY twitter_data
ACCESS PARAMETERS
(RECORDS DELIMITED BY 0x'0A'
DISABLE_DIRECTORY_LINK_CHECK
FIELDS (tweet CHAR(32000)))
LOCATION (twitter_data:'twitter_data.json'))
PARALLEL
REJECT LIMIT UNLIMITED;

CREATE TABLE tweet_table (
id    VARCHAR2(32)    NOT NULL PRIMARY KEY,
date_loaded    TIMESTAMP(6) with TIME ZONE,
tweet    CLOB
CONSTRAINT ensure_json CHECK (tweet IS JSON))
LOB (tweet) STORE AS (CACHE);

Note the check constraint in the table definition above. It allows to filter out any malformed or incomplete data points.

Loading Data in Oracle Database table

Insert into tweet_table (id, date_loaded, tweet)
Select SYS_GUID(), SYSTIMESTAMP, tweet from tweet_external_table
WHERE tweet is JSON;
Commit;
Check the number of rows in tweet_table to confirm that I have 1.5 million tweets.
SELECT count(*) from tweet_table;
COUNT(*)
----------
1550931

Analysing JSON Document using ‘dot’ notation and SQL/JSON Path Expression

Now that I have twitter data in the database, I will run the following three basic analysis on this data set using ‘dot’ notation and then using JSON_QUERY and JSON_VALUE expression.

Analysis #1 – Distribution of tweets per language

Note that column name where JSON is stored is called ‘tweet’. Using ‘dot’ notation in SQL, you can traverse the JSON document easily. I think this is quite powerful for people familiar with SQL. An equivalent in other programming language could easily be few lines of code.
Tweet_dist_by_language_2
And the same query can be written using SQL/JSON expression as shown below:
SELECT lang, count(*) NumTweets FROM (
SELECT json_value(tweet, '$.lang') lang FROM tweet_table )
GROUP BY lang;

Analysis #2 – Distribution of tweets per country.

A tweet object contains location information only if it is geo-tagged. In my dataset I noticed that a large number of tweets do not have location data i.e value in tag tweet.place.country is null. For this reason, I have plotted the distribution of tweets per country as percentage of total geo-tagged tweets.
 
Tweet_dist_by_country_2
And the same query can be written using SQL/JSON expression as shown below:
SELECT country, count(*) NumTweets FROM (
SELECT json_value(tweet, '$.place.country') country from tweet_table)
GROUP BY country;

Analysis #3 – Distribution of Tweets for each collected tokens.

Finally, check the distribution of tweets across all tokens I have collected in 2 days period. The CASE statement in the SQL if used for categorisation based on tweet_text. I have over 700K with ‘unknown’ tag i.e. none of the tokens that my data collection script intending to collect. I need to check why these tweets were collected, but that is not in the scope of this blog post.
Tweet_dist_by_token_2
  
Be aware
 
I noticed that dot notation does not work if one of the tag in JSON document is Oracle reserve word. For example, I could not access the “user” details from tweet using ‘dot’ notation and error is shown below. However, it worked just fine with JSON_VALUE SQL/JSON expression 
select a.tweet.user from tweet_table a
*
ERROR at line 1:
ORA-01747: invalid user.table.column, table.column, or column specification
Using JSON_VALUE expression:
 
select  json_value(tweet, '$.user.name') username,
json_value(tweet, '$.user.followers_count') followers_count
from tweet_table;
The purpose of this blog post is to highlight storage and query capabilities of Oracle 12c for JSON data. This does not include any benchmark or performance testing. There are lot more to JSON processing capabilities in Oracle database than what I have written and demonstrated in this blog post. IMO, it is worth reading the documentation and explore the possibilities.
 
Access to tweet data
 
I can share the dataset if you would like to use it for your own experiments. It’s compressed size is 1.1GB.
 
References:
 The following links provide more information about JSON processing as well as understanding Tweet objects.