Issue:Modifiedby and createdby produce inefficient SQL

From FollowTheScore
Revision as of 20:39, 7 August 2013 by Ftiercel (talk | contribs) (Implementation.)
Jump to: navigation, search
Description: modifiedby and createdby produce inefficient SQL
Extension / Version: DPL   /   2.0
Type / Status: Bug   /   open

Problem

This ticket is about inefficient SQL queries in DPL that take very long to run. I provide some new queries that are up to 1000x faster.

modifiedby

The following DPL to list articles modified by a user takes over 10 seconds on our wiki:

modifiedby=Smith
namespace=

Its query produces the following SQL (obtained using "debug=6"):

SELECT DISTINCT
 `vpw_page`.page_namespace as page_namespace
,`vpw_page`.page_title as page_title
,`vpw_page`.page_id as page_id
,`vpw_page`.page_title as sortkey
FROM
 `vpw_page`
WHERE
 1=1
 AND `vpw_page`.page_namespace IN ('0')
 AND `vpw_page`.page_is_redirect=0
 AND 'Smith' in (
  select rev_user_text
  from `vpw_revision`
  where `vpw_revision`.rev_page=page_id)
ORDER BY page_title ASC
LIMIT 0, 1000

Here is a better query for modified articles that runs in 0.02 seconds:

SELECT DISTINCT
 `vpw_page`.page_namespace as page_namespace
,`vpw_page`.page_title as page_title
,`vpw_page`.page_id as page_id
,`vpw_page`.page_title as sortkey
FROM
 `vpw_page`
 INNER JOIN `vpw_revision` ON (`vpw_page`.page_id = `vpw_revision`.rev_page)
WHERE
 1=1
 AND `vpw_page`.page_namespace IN ('0')
 AND `vpw_page`.page_is_redirect=0
 AND `vpw_revision`.rev_user_text = 'Smith'
ORDER BY page_title ASC
LIMIT 0, 1000

createdby

Here is an even worse DPL query to get created articles by a user. It takes 2 minutes and 40 seconds to run on our wiki:

<dpl>
createdby=Smith
namespace=
</dpl>

And here is better SQL that runs in 0.05 seconds and produces the same results:

SELECT
 `vpw_page`.page_namespace as page_namespace
 ,`vpw_page`.page_title as page_title
 ,`vpw_page`.page_id as page_id
 , `vpw_page`.page_title as sortkey
FROM
 `vpw_page`
 INNER JOIN `vpw_revision` ON (`vpw_page`.page_id = `vpw_revision`.rev_page)
WHERE
 1=1
 AND `vpw_page`.page_namespace IN ('0')
 AND `vpw_page`.page_is_redirect=0
 AND `vpw_revision`.rev_user_text = 'Smith'
 AND `vpw_revision`.rev_parent_id = 0
ORDER BY page_title ASC
LIMIT 0, 1000

Reply

I have looked at the code and I think it can be implemented that way:

In DPLMain.php at line 2142-2156 remove this:

- if ( $sCreatedBy != "" ) {
- $sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sCreatedBy) . ' = (SELECT rev_user_text FROM '.$sRevisionTable
- .' WHERE '.$sRevisionTable.'.rev_page=page_id ORDER BY '.$sRevisionTable.'.rev_timestamp ASC LIMIT 1)';
- }
- if ( $sNotCreatedBy != "" ) {
- $sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sNotCreatedBy) . ' != (SELECT rev_user_text FROM '.$sRevisionTable
- .' WHERE '.$sRevisionTable.'.rev_page=page_id ORDER BY '.$sRevisionTable.'.rev_timestamp ASC LIMIT 1)';
- }
- if ( $sModifiedBy != "" ) {
- $sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sModifiedBy) . ' IN (SELECT rev_user_text FROM '.$sRevisionTable
- .' WHERE '.$sRevisionTable.'.rev_page=page_id)';
- }
- if ( $sNotModifiedBy != "" ) { - $sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sNotModifiedBy) . ' NOT IN (SELECT rev_user_text FROM '.$sRevisionTable.' WHERE '.$sRevisionTable.'.rev_page=page_id)';
- }

...at line 2246 remove this:

- ' FROM ' . $sSqlRevisionTable . $sSqlRCTable . $sSqlPageLinksTable . $sSqlExternalLinksTable . $sPageTable;


Now in DPLMain.php at line 1783 add this:

$sSqlCreationRevisionTable = '';
$sSqlNoCreationRevisionTable = '';
$sSqlChangeRevisionTable = '';

...at line 2142-2156 add this:

if ( $sCreatedBy != "" ) {
$sSqlCreationRevisionTable = $sRevisionTable . ' AS creation_rev, ';
$sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sCreatedBy) . ' = creation_rev.rev_user_text'
.' AND creation_rev.rev_page = page_id'
.' AND creation_rev.rev_parent_id = 0';
}
if ( $sNotCreatedBy != "" ) {
$sSqlNoCreationRevisionTable = $sRevisionTable . ' AS no_creation_rev, ';
$sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sCreatedBy) . ' != no_creation_rev.rev_user_text'
.' AND no_creation_rev.rev_page = page_id'
.' AND no_creation_rev.rev_parent_id = 0';
}
if ( $sModifiedBy != "" ) {
$sSqlChangeRevisionTable = $sRevisionTable . ' AS change_rev, ';
$sSqlCond_page_rev .= ' AND ' . $dbr->addQuotes($sCreatedBy) . ' = change_rev.rev_user_text'
.' AND change_rev.rev_page = page_id';
}
if ( $sNotModifiedBy != "" ) {
$sSqlCond_page_rev .= ' AND NOT EXISTS (SELECT 1 FROM '.$sRevisionTable.' WHERE '.$sRevisionTable.'.rev_page=page_id AND '.$sRevisionTable.'.rev_user_text = ' . $dbr->addQuotes($sNotModifiedBy) . ' LIMIT 1)';
}

...at line 2246 add this:

' FROM ' . $sSqlRevisionTable . $sSqlCreationRevisionTable . $sSqlNoCreationRevisionTable . $sSqlChangeRevisionTable . $sSqlRCTable . $sSqlPageLinksTable . $sSqlExternalLinksTable . $sPageTable;

Instead of creating subquery, it only makes joins except for NotModifiedBy, but this query retrieves less data from the database. Unfortunately, I can't test this code, so you would have to experiment this.

You may think that it makes the code even more complex. Don't worry. I have also a solution to make your code simpler, safer, cleaner and not slower. Currently, when you are exploring all the active options (createdby, category...), you are building the query at the same time. Another solution is to do this in two steps:

  1. Before processing, create arrays. One for the SELECT clause, one for the FROM clause, one for the WHERE clause and one for the ORDER BY clause,
  2. When you process all the options, fill the arrays in any order. You can complete the FROM array and the WHERE array at the same time,
  3. Once you have stored all the useful information in all the arrays, create a function that visits the SELECT array to format the SELECT clause, visit the FROM array to format the FROM clause and so on.
  • The SELECT array may contain a list of strings and each string is a field,
  • The FROM array may contain a list of pair of strings. The first one is a name of table and the second is the alias,
  • The WHERE array may contain a list of strings and each string is a condition. A better version could handle the AND and OR operators,
  • The ORDER BY array may contain a list of strings and each string is a field.

Doing this, it would be easier to debug as you can call the format function at any time to print the query. You can change the code step by step if you want, by implementing just the arrays one by one. It is an archaic version of the builder design pattern. Ftiercel (talk) 21:39, 7 August 2013 (CEST)