Planet MySQL

Planet MySQL - https://planet.mysql.com
  1. Database administrators are responsible for maintaining the privacy and integrity of data. When the data contains confidential information, your company has a legal obligation to ensure that privacy is maintained. Even so, being able to access the information contained in that dataset, for example for testing or reporting purposes, has great value so what to do? MySQL Enterprise Edition offers data masking and de-identification, so I decided to contribute similar functionality to Percona Server for MySQL. In this post, I provide some background context and information on how to use these new functions in practice. Some context One of the most important assets of any company is data. Having good data allows engineers to build better systems and user experiences. Even through our most trivial activities, we continuously generate and share great volumes of data. I’m walking down the street and if I take a look at my phone it’s quite straightforward to get recommendations for a place to have lunch. The platform knows that it’s almost lunch time and that I have visited this nearby restaurant, or a similar one, a few times in the past. Sounds cool, right? But this process could be more manual than we might think at first. Even if the system has implemented things like AI or Machine Learning, a human will have validated the results; they might have taken a peek to ensure that everything is fine; or perhaps they are developing some new cool feature that must be tested… And this means that someone, somewhere has the ability to access my data. Or your data. Now, that is not so great, is it? In the last decade or so, governments around the world have taken this challenge quite seriously. They have enforced a series of rules to guarantee that the data is not only safely stored, but also safely used. I’m sure you will have heard terms like PCI, GDPR or HIPAA. They contain mandatory guidelines for how our data can be used, for primary or secondary purposes, and if it can be used at all. Data masking and de-identification One of the most basic safeguarding rules is that if the data is to be used for secondary purposes – such as for data analytics – it has to be de-identified in a way that it would make impossible identify the original individual. Let’s say that the company ACME is storing employee data. We will use the example database of employees that’s freely available. Employee number First name Last name Birth date Gender Hire date Gross salary Salary from date Salary to date We can clearly see that all those fields can be classified as private information. Some of these directly identify the original individual, like employee number or first + last name. Others could be used for indirect identification: I could ask my co-workers their birthday and guess the owner of that data using birth date. So, here is where de-identification and data-masking come into play. But what are the differences? De-identification transforms the original data into something different that could look more or less real. For example, I could de-identify birth date and get a different date. However, this method would make that information unusable if I want to see the relationship between salary and employee’s age. On the other hand, data-masking transforms the original data leaving some part untouched. I could mask birth date replacing the month and day for January first. That way, the year would be retained and that would allow us to identify that salary–employee’s age relationship. Of course, if the dataset I’m working with is not big enough, certain methods of data-masking would be inappropriate as I could still deduce who the data belonged to. MySQL data masking Oracle’s MySQL Enterprise Edition offers a de-identification and data-masking solution for MySQL, using a flexible set of functions that cover most of our needs. Percona Server for MySQL 8.0.17 introduces that functionality as an open source plugin, and is compatible with Oracle’s implementation. You no longer need to code slow and complicated stored procedures to achieve data masking, and you can migrate the processes that were written for the MySQL Enterprise Edition to Percona Server for MySQL. Go grab a cup of coffee and contribute something cool to the community with all that time you have got back. ☺ In the lab Put on your thinking cap and let’s see how it works. First we need an instance of Percona MySQL Server 8.0.17 or newer. I think containers are the most flexible way to test new stuff so I will be using that, but you could use a virtual server or just a traditional setup. Let’s download the latest version of Percona MySQL Server in a ready to run container: docker pull percona:8.0.17-1 Eventually that command should work but sadly, Percona hadn’t built this version of the docker image when this article was written. Doing it yourself is quite simple, though, and by the time you read this it will likely be already there. Once in place, Running an instance of Percona MySQL Server has never been so easy: docker run --name ps -e MYSQL_ROOT_PASSWORD=secret -d percona:8.0.17-8 We’ll logon to the new container: docker exec -ti ps mysql -u root -p Now is the time to download the test database employees from GitHub and load it into our Percona Server. You can follow the official instructions in the project page. Next step is to enable the data de-identification and masking feature. Installing the data masking module in Percona MySQL Server is easier than in Oracle. mysql> INSTALL PLUGIN data_masking SONAME 'data_masking.so'; Query OK, 0 rows affected (0.06 sec) This automatically defines a set of global functions in our MySQL instance, so we don’t need to do anything else. A new concept: Dictionaries Sometimes we will like to generate new data selecting values from a predefined collection. For example we could want to have first name  values that are really first names and not a random alphanumeric. This will make our masked data looks real, and it’s perfect for creating demo or QA environments. For this task we have dictionaries. They are nothing more than text files containing a value per line that are loaded into MySQL memory. You need to be aware that the contents of the file are fully loaded into memory and that the dictionary only exists while MySQL is running. So keep this in mind before loading any huge file or after restarting the instance. For our lab we will load two dictionaries holding first and last names. You can use these files or create different ones: first names and last names Store the files in a folder of your database server (or container) readable by the mysqld  process.wget https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/first_names.all.txt docker cp first_names.all.txt ps:/tmp/ wget https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/last_names.all.txt docker cp last_names.all.txt ps:/tmp/ Once the files are in our server we can map them as MySQL dictionaries. mysql> select gen_dictionary_load('/tmp/first_names.all.txt', 'first_names'); +----------------------------------------------------------------+ | gen_dictionary_load('/tmp/first_names.all.txt', 'first_names') | +----------------------------------------------------------------+ | Dictionary load success                                        | +----------------------------------------------------------------+ 1 row in set (0.04 sec) mysql> select gen_dictionary_load('/tmp/last_names.all.txt', 'last_names'); +--------------------------------------------------------------+ | gen_dictionary_load('/tmp/last_names.all.txt', 'last_names') | +--------------------------------------------------------------+ | Dictionary load success                                      | +--------------------------------------------------------------+ 1 row in set (0.03 sec) Masking some data Now let’s take another look at our employees tablemysql> show columns from employees; +------------+---------------+------+-----+---------+-------+ | Field      | Type   | Null | Key | Default | Extra | +------------+---------------+------+-----+---------+-------+ | emp_no     | int(11)   | NO | PRI | NULL    | | | birth_date | date          | NO | | NULL   | | | first_name | varchar(14)   | NO | | NULL   | | | last_name  | varchar(16)   | NO | | NULL   | | | gender     | enum('M','F') | NO   | | NULL   | | | hire_date  | date   | NO | | NULL    | | +------------+---------------+------+-----+---------+-------+ Ok, it’s very likely we will want to de-identify everything in this table. You can apply different methods to achieve your security requirements, but I will create a view with the following transformations: emp_no: get a random value from 900.000.000 to 999.999.999 birth_date: set it to January 1st of the original year first_name: set a random first name from a list of names that we have in a text file last_name: set a random last name from a list of names that we have in a text file gender: no transformation hire_date: set it to January 1st of the original year CREATE VIEW deidentified_employees AS SELECT   gen_range(900000000, 999999999) as emp_no,   makedate(year(birth_date), 1) as birth_date,   gen_dictionary('first_names') as first_name,   gen_dictionary('last_names') as last_name,   gender,   makedate(year(hire_date), 1) as hire_date FROM employees; Let’s check how the data looks in our de-identified view. mysql> SELECT * FROM employees LIMIT 10; +--------+------------+------------+-----------+--------+------------+ | emp_no | birth_date | first_name | last_name | gender | hire_date  | +--------+------------+------------+-----------+--------+------------+ |  10001 | 1953-09-02 | Georgi     | Facello | M | 1986-06-26 | |  10002 | 1964-06-02 | Bezalel    | Simmel | F | 1985-11-21 | |  10003 | 1959-12-03 | Parto      | Bamford | M | 1986-08-28 | |  10004 | 1954-05-01 | Chirstian  | Koblick | M | 1986-12-01 | |  10005 | 1955-01-21 | Kyoichi    | Maliniak | M | 1989-09-12 | |  10006 | 1953-04-20 | Anneke     | Preusig | F | 1989-06-02 | |  10007 | 1957-05-23 | Tzvetan    | Zielinski | F | 1989-02-10 | |  10008 | 1958-02-19 | Saniya     | Kalloufi | M | 1994-09-15 | |  10009 | 1952-04-19 | Sumant     | Peac | F | 1985-02-18 | |  10010 | 1963-06-01 | Duangkaew  | Piveteau | F | 1989-08-24 | +--------+------------+------------+-----------+--------+------------+ 10 rows in set (0.00 sec) mysql> SELECT * FROM deidentified_employees LIMIT 10; +-----------+------------+------------+---------------+--------+------------+ | emp_no    | birth_date | first_name | last_name     | gender | hire_date | +-----------+------------+------------+---------------+--------+------------+ | 930277580 | 1953-01-01 | skaidrīte  | molash | M | 1986-01-01 | | 999241458 | 1964-01-01 | grasen     | cessna | F | 1985-01-01 | | 951699030 | 1959-01-01 | imelda     | josephpauline | M | 1986-01-01 | | 985905688 | 1954-01-01 | dunc       | burkhardt | M | 1986-01-01 | | 923987335 | 1955-01-01 | karel      | wanamaker | M | 1989-01-01 | | 917751275 | 1953-01-01 | mikrut     | allee | F | 1989-01-01 | | 992344830 | 1957-01-01 | troyvon    | muma | F | 1989-01-01 | | 980277046 | 1958-01-01 | aliziah    | tiwnkal | M | 1994-01-01 | | 964622691 | 1952-01-01 | dominiq    | legnon | F | 1985-01-01 | | 948247243 | 1963-01-01 | sedale     | tunby | F | 1989-01-01 | +-----------+------------+------------+---------------+--------+------------+ 10 rows in set (0.01 sec) The data looks quite different, but remains good enough to apply some analytics and get meaningful results. Let’s de-identify the table salaries  this time.mysql> show columns from salaries; +-----------+---------+------+-----+---------+-------+ | Field     | Type | Null | Key | Default | Extra | +-----------+---------+------+-----+---------+-------+ | emp_no    | int(11) | NO   | PRI | NULL |       | | salary    | int(11) | NO   | | NULL |       | | from_date | date    | NO | PRI | NULL |       | | to_date   | date | NO   | | NULL |       | +-----------+---------+------+-----+---------+-------+ We could use something like this: CREATE VIEW deidentified_salaries AS SELECT gen_range(900000000, 999999999) as emp_no, gen_range(40000, 80000) as salary, mask_inner(date_format(from_date, '%Y-%m-%d'), 4, 0) as from_date, mask_outer(date_format(to_date, '%Y-%m-%d'), 4, 2, '0') as to_date FROM salaries; We are using again the function gen_range . For the dates this time we are using the very flexible functions mask_inner  and mask_outer  that replace some characters in the original string. Let’s see how the data looks now. In a real life exercise we would like to have the same values for emp_no across all the tables to keep referential integrity. This is where I think the original MySQL data-masking plugin falls short, as we don’t have deterministic functions using the original value as seed. mysql> SELECT * FROM salaries LIMIT 10; +--------+--------+------------+------------+ | emp_no | salary | from_date  | to_date | +--------+--------+------------+------------+ |  10001 |  60117 | 1986-06-26 | 1987-06-26 | |  10001 |  62102 | 1987-06-26 | 1988-06-25 | |  10001 |  66074 | 1988-06-25 | 1989-06-25 | |  10001 |  66596 | 1989-06-25 | 1990-06-25 | |  10001 |  66961 | 1990-06-25 | 1991-06-25 | |  10001 |  71046 | 1991-06-25 | 1992-06-24 | |  10001 |  74333 | 1992-06-24 | 1993-06-24 | |  10001 |  75286 | 1993-06-24 | 1994-06-24 | |  10001 |  75994 | 1994-06-24 | 1995-06-24 | |  10001 |  76884 | 1995-06-24 | 1996-06-23 | +--------+--------+------------+------------+ 10 rows in set (0.00 sec) mysql> SELECT * FROM deidentified_salaries LIMIT 10; +-----------+--------+------------+------------+ | emp_no    | salary | from_date  | to_date | +-----------+--------+------------+------------+ | 929824695 | 61543  | 1986XXXXXX | 0000-06-00 | | 954275265 | 63138  | 1987XXXXXX | 0000-06-00 | | 948145700 | 53448  | 1988XXXXXX | 0000-06-00 | | 937927997 | 54704  | 1989XXXXXX | 0000-06-00 | | 978459605 | 78179  | 1990XXXXXX | 0000-06-00 | | 993464164 | 75526  | 1991XXXXXX | 0000-06-00 | | 946692434 | 51788  | 1992XXXXXX | 0000-06-00 | | 979870243 | 54807  | 1993XXXXXX | 0000-06-00 | | 958708118 | 70647  | 1994XXXXXX | 0000-06-00 | | 945701146 | 76056  | 1995XXXXXX | 0000-06-00 | +-----------+--------+------------+------------+ 10 rows in set (0.00 sec) Clean-up Remember that when you’re done, you can free up memory by removing the dictionaries. Restarting the instance will also remove the dictionaries. mysql> SELECT gen_dictionary_drop('first_names'); +------------------------------------+ | gen_dictionary_drop('first_names') | +------------------------------------+ | Dictionary removed                 | +------------------------------------+ 1 row in set (0.01 sec) mysql> SELECT gen_dictionary_drop('last_names'); +------------------------------------+ | gen_dictionary_drop('last_names') | +------------------------------------+ | Dictionary removed                 | +------------------------------------+ 1 row in set (0.01 sec) If you use the MySQL data-masking plugin to define different levels of access to the data, remember that you will need to load the dictionaries each time the instance is restarted. With this usage, for example, you could control the data that someone in support has access to, very much like a bargain-basement virtual private database solution. (I’m not proposing this for production systems!) Other de-identification and masking functions Percona Server for MySQL Data-Masking includes more functions that the ones we’ve seen here. We have specialized functions for Primary Account Numbers (PAN), Social Security Numbers (SSN), phone numbers, e-Mail addresses… And also generic functions that will allow us to de-identify types without a specialized method. Being an open source plugin it should be quite easy to implement any additional methods and contribute it to the broader community. Next Steps Using these functions we can de-identify and mask any existing dataset. But if you are populating a lower level environment using production data you would want to store the transformed data only. To achieve this you could choose between various options. Small volumes of data: use “de-identified” views to export the data and load into a new database using mysqldump or mysqlpump. Medium volumes of data: Clone the original database and de-identify locally the data using updates. Large volumes of data option one: using replication, create a master -> slave chain with STATEMENT binlog format and define triggers de-identifying the data on the slave. Your master can be a slave to the master (using log_slave_updates), so you don’t need to run your primary master in STATEMENT mode. Large volumes of data option two: using multiplexing in ProxySQL, configure ProxySQL to send writes to a clone server where you have defined triggers to de-identify the data. Future developments While de-identifying complex schemas we could find that, for example, the name of a person is stored in multiple tables (de-normalized tables). In this case, these functions would generate different names and the resulting data will look broken. You can solve this using a variant of the dictionary functions that will obtain the value based on the original value and passed as parameter: gen_dictionary_deterministic('Francisco', 'first_names') This not-yet-available function would always return the same value using that dictionary file, but in such a way that the de-identification cannot be reversed. Oracle doesn’t currently support this, so we will expand Percona Data-Masking plugin to introduce this as a unique feature. However, that will be in another contribution, so stay tuned for more exciting changes to Percona Server for MySQL Data Masking. —Image: Photo by Finan Akbar on Unsplash The content in this blog is provided in good faith by members of the open source community. Percona has not edited or tested the technical content (although in this case, of course, we have tested the data masking feature incorporated into Percona Server for MySQL 8.0/17, just not the examples in this blog). Views expressed are the authors’ own. When using the advice from this or any other online resource test ideas before applying them to your production systems, and always secure a working back up. The post Percona Server for MySQL 8.0 – New Data Masking Feature appeared first on Percona Community Blog.
  2. MySQL and PHP is a love story that started long time ago. However the love story with MySQL 8.0 was a bit slower to start… but don’t worry it rules now ! The support of MySQL 8.0’s new default authentication method in PHP took some time and was added in PHP 7.2.8 but removed in PHP 7.2.11. Now it’s fully supported in PHP 7.4 ! If you have installed PHP 7.4, you can see that the new plugin auth_plugin_caching_sha2_passwordis now available: # php -i | grep "Loaded plugins|PHP Version " | tail -n2 PHP Warning: Module 'mysql_xdevapi' already loaded in Unknown on line 0 PHP Version => 7.4.0 Loaded plugins => mysqlnd,debug_trace,auth_plugin_mysql_native_password, auth_plugin_mysql_clear_password, auth_plugin_caching_sha2_password, auth_plugin_sha256_password So no need to create a user with mysql_native_passwordas authentication method in MySQL 8.0 as explained in the following posts: https://lefred.be/content/mysql-8-0-17-and-drupal-8-7/ https://lefred.be/content/migrating-to-mysql-8-0-for-wordpress-episode-1/ https://lefred.be/content/drupal-and-mysql-8-0-11-are-we-there-yet/ https://lefred.be/content/joomla-and-mysql-8-0-12/ https://lefred.be/content/mysql-8-0-and-magento/ In summary, if you want to use a more secure method to connect to your MySQL 8.0 form your PHP application, make sure you upgrade to PHP 7.4
  3. Introduction In this article, we are going to see how a CROSS JOIN works, and we will also make use of this SQL join type to build a poker card game. Database table model For our poker card game application, we have created the ranks and suits database tables: The ranks table defines the ranking of cards, as well as the name and symbol used for each card rank: The suits table describes the four possible categories used by the French playing cards: Cartesian product In the set theory, the Cartesian product... Read More The post A beginner’s guide to SQL CROSS JOIN appeared first on Vlad Mihalcea.
  4. In my road trip of the Open Source projects using MySQL, after having tested WordPress, Drupal and Joomla, let’s try to install Magento using MySQL 8.0 ! In Magento’s manual, we can see that the project requires MySQL 5.6 and supports 5.7.x since versoin 2.1.2. In my test, I will use Magento 2.3.3, the latest stable when writing this article. The manual stipulates that we should use ROW based replication but not GTID because Magento 2 is using CREATE TEMPORARY TABLE inside transactions. In fact, this limitation doesn’t exist anymore since MySQL 8.0.13. From MySQL 8.0.13, when binlog_format is set to ROW or MIXED, CREATE TEMPORARY TABLE and DROP TEMPORARY TABLE statements are allowed inside a transaction, procedure, function, or trigger when GTIDs are in use. The statements are not written to the binary log and are therefore not replicated to slaves. The use of row-based replication means that the slaves remain in sync without the need to replicate temporary tables. If the removal of these statements from a transaction results in an empty transaction, the transaction is not written to the binary log. See the MySQL Manual. So this is not a limitation anymore. And for this installation I will of course use the latest version of MySQL 8.0 available right now, MySQL 8.0.18. Setup the database So MySQL 8.0.18 is installed and now we need to create the database/schema and setup the credentials to install Magento: mysql> CREATE DATABASE magento2; mysql> CREATE USER magento2 IDENTIFIED BY 'magento2'; mysql> GRANT ALL ON magento2.* TO magento2; We can see that we are using MySQL 8.0 and the default authenticatio plugin: mysql> SELECT Host, User, plugin, @@version FROM mysql.user WHERE user='magento2'; +------+----------+-----------------------+-----------+ | Host | User | plugin | @@version | +------+----------+-----------------------+-----------+ | % | magento2 | caching_sha2_password | 8.0.18 | +------+----------+-----------------------+-----------+ As in the future we will use any kind of replication, we will also enable GTID: mysql> SET persist enforce_gtid_consistency=on; mysql> SET persist_only gtid_mode=on; mysql> RESTART; Magento Installation We can now start the installation wizard of Magento: The first checks are not related to MySQL and all necessary packages are installed. FYI, this is Oracle Linux 8: Now it’s time to insert the MySQL information: And when we press on Next, we have our first small issue: This is exactly what the error message is telling. The new authentication plugin is now supported. We can also see that in the php error log: [11-Dec-2019 20:30:49 UTC] PHP Fatal error: Uncaught PDOException: PDO::__construct(): The server requested authentication method unknown to the client [caching_sha2_password] in /var/www/html/vendor/magento/zendframework1/library/Zend/Db/Adapter/Pdo/Abstract.php:128 The PHP version is 7.2.11 which doesn’t support MySQL 8’s new default secure authentication plugin: caching_sha2_password. It was supported in 7.2.8 but removed on 7.2.11, for more info check: https://lefred.be/content/php-7-2-8-mysql-8-0/ https://lefred.be/content/mysql-8-0-17-and-drupal-8-7/ So now, we just need to modify the authentication method of our user: mysql> ALTER USER magento2 IDENTIFIED WITH 'mysql_native_password' BY 'magento2'; And we can press Next again in Magento’s installation wizard and fill all the next Steps until Step 6: Let’s go an press Install Now ! The installation process seems to hang at 58%… and when we check the Console Log we can see the following: We followed the manual, but it seems our user needs more privileges. But let’s try to not provide that and try the second option: mysql> SET PERSIST log_bin_trust_function_creators=1; We can refresh the page and click again on Install Now. The process continues to 91% and then another strange error: I checked the status of the table and nothing seems wrong, so I refreshed the table and restart the process that resumes and goes to the end successfully: And of course, we can now access the Site we just deployed: Summary In summary, Magento works fine with MySQL 8.0, you have very few changes to perform, they are related to the user Magento that connect to MySQL and one global variable: mysql> CREATE USER magento2 IDENTIFIED WITH 'mysql_native_password' BY 'magento2'; mysql> GRANT ALL ON magento2.* TO magento2; mysql> SET PERSIST log_bin_trust_function_creators=1; And of course, there is no more reason to not use GTIDs !
  5. To avoid infinite replication loops MySQL doesn’t allow you to have log_slave_updates and replicate-same-server-id. When using GTIDs that may lead to something not expected that you may not be aware of. In this scenario, we have 2 MySQL servers using GTID. The sever uuid part of the GTID has been modified in the illustration to make it more clear. Both servers have log_slave_updates enabled too: So far nothing unusual. So let’s write data on the master (MySQL A): We can see that this first transaction is identified by its GTID where the uuid matches MySQL A and the sequence number is 1. Let’s write some data again: All good. Now let’s take a backup on the replica (MySQL B): Backup is consistent and matches the data on both servers. Now let’s write again some data: Of course the backup that was taken earlier does not change. All suddenly, MySQL A crashes and goes away… We promote MySQL B as new writer and we use it to write again some data: We can notice that the GTID changed to use the uuid of MySQL B. (This information is contained in the variable gtid_executed). It’s time to restore our backup on MySQL A: And we configure MySQL A to become replica of MySQL B: Wow ! All the transactions happened after the backup on MySQL A have been ignored ! In fact, this is again to protect our user to have problems like circular replication and infinite loop. To be able to replicate the missing transaction is to change the server_id to a new unique value after the restore and before starting replication. Then replication will work as expected: In summary, if you have enabled log_slave_updates and you want to recreate a master from a backup taken on a replica, you must change the server_id. Even if you use GTIDs, server_id is for the moment still to be taken in consideration. If you plan to write back on MySQL A, it’s always safer to also change its server_uuid to avoid any split-brain situation.