PHP语言下利用scws+sphinx+mysql打造千万级数据全文搜索
小标 2018-08-02 来源 : 阅读 1191 评论 0

摘要:本文主要向大家介绍了PHP语言下利用scws+sphinx+mysql打造千万级数据全文搜索,通过具体的内容向大家展示,希望对大家学习php语言有所帮助。

本文主要向大家介绍了PHP语言下利用scws+sphinx+mysql打造千万级数据全文搜索,通过具体的内容向大家展示,希望对大家学习php语言有所帮助。

Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式 的XML数据。

Sphinx创建索引的速度为:创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。

Sphinx的特性如下:

a) 高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒);

b) 高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒);

c) 可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);

d) 提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;

e) 支持分布式搜索;

f) 支持短语搜索

g) 提供文档摘要生成

h) 可作为MySQL的存储引擎提供搜索服务;

i) 支持布尔、短语、词语相似度等多种检索模式;

j) 文档支持多个全文检索字段(最大不超过32个);

k) 文档支持多个额外的属性信息(例如:分组信息,时间戳等);

l) 支持断词;

虽然mysql的MYISAM提供全文索引,但是性能却不敢让人恭维

开始搭建

系统环境:centos6.5+php5.6+apache+mysql

1、安装依赖包

[php]view plaincopy

yum-yinstallmakegccg++gcc-c++libtoolautoconfautomakeimakephp-develmysql-devellibxml2-develexpat-devel

2、安装Sphinx

[php]view plaincopy

yuminstallexpatexpat-devel

wget-chttps://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz

tarzxvfsphinx-2.0.7-release.tar.gz

cdsphinx-2.0.7-release

./configure--prefix=/usr/local/sphinx--with-mysql--with-libexpat--enable-id64

make&&makeinstall

3、安装libsphinxclient,php扩展用到

[php]view plaincopy

cdapi/libsphinxclient

./configure--prefix=/usr/local/sphinx/libsphinxclient

make&&makeinstall

4、安装Sphinx的PHP扩展:我的是5.6需装sphinx-1.3.3.tgz,如果是php5.4以下可sphinx-1.3.0.tgz

[php]view plaincopy

wget-chttps://pecl.php.net/get/sphinx-1.3.3.tgz

tarzxvfsphinx-1.3.3.tgz

cdsphinx-1.3.3

phpize

./configure--with-sphinx=/usr/local/sphinx/libsphinxclient/--with-php-config=/usr/bin/php-config

make&&makeinstall

成功后会提示:

Installingsharedextensions:/usr/lib64/php/modules/

echo"[Sphinx]">>/etc/php.ini

echo"extension=sphinx.so">>/etc/php.ini

#重启apache

servicehttpdrestart

5、创建测试数据

[php]view plaincopy

CREATETABLEIFNOTEXISTS`items`(

`id`int(11)NOTNULLAUTO_INCREMENT,

`title`varchar(255)NOTNULL,

`content`textNOTNULL,

`created`datetimeNOTNULL,

PRIMARYKEY(`id`)

)ENGINE=InnoDBDEFAULTCHARSET=utf8COMMENT='全文检索测试的数据表'AUTO_INCREMENT=11;

INSERTINTO`items`(`id`,`title`,`content`,`created`)VALUES

(1,'linuxmysql集群安装','MySQLCluster是MySQL适合于分布式计算环境的高实用、可拓展、高性能、高冗余版本','2016-09-0700:00:00'),

(2,'mysql主从复制','mysql主从备份(复制)的基本原理mysql支持单向、异步复制,复制过程中一个服务器充当主服务器,而一个或多个其它服务器充当从服务器','2016-09-0600:00:00'),

(3,'hello','canyousearchme','2016-09-0500:00:00'),

(4,'mysql','mysqlisthebestdatabase','2016-09-0300:00:00'),

(5,'mysql索引','关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车','2016-09-0100:00:00'),

(6,'集群','关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车','0000-00-0000:00:00'),

(9,'复制原理','redis也有复制','0000-00-0000:00:00'),

(10,'redis集群','集群技术是构建高性能网站架构的重要手段,试想在网站承受高并发访问压力的同时,还需要从海量数据中查询出满足条件的数据,并快速响应,我们必然想到的是将数据进行切片,把数据根据某种规则放入多个不同的服务器节点,来降低单节点服务器的压力','0000-00-0000:00:00');

CREATETABLEIFNOTEXISTS`sph_counter`(

`counter_id`int(11)NOTNULL,

`max_doc_id`int(11)NOTNULL,

PRIMARYKEY(`counter_id`)

)ENGINE=MyISAMDEFAULTCHARSET=utf8COMMENT='增量索引标示的计数表';

以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略,使用Sphinx自带的一元分词。

6、Sphinx配置:注意修改数据源配置信息

[php]view plaincopy

vi/usr/local/sphinx/etc/sphinx.conf

sourceitems{

type=mysql

sql_host=localhost

sql_user=root

sql_pass=123456

sql_db=sphinx_items

sql_query_pre=SETNAMESutf8

sql_query_pre=SETSESSIONquery_cache_type=OFF

sql_query_pre=REPLACEINTOsph_counterSELECT1,MAX(id)FROMitems

sql_query_range=SELECTMIN(id),MAX(id)FROMitems\

WHEREid<=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)

sql_range_step=1000

sql_ranged_throttle=1000

sql_query=SELECTid,title,content,created,0asdeletedFROMitems\

WHEREid<=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)\

ANDid>=$startANDid<=$end

sql_attr_timestamp=created

sql_attr_bool=deleted

}

sourceitems_delta:items{

sql_query_pre=SETNAMESutf8

sql_query_range=SELECTMIN(id),MAX(id)FROMitems\

WHEREid>(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)

sql_query=SELECTid,title,content,created,0asdeletedFROMitems\

WHEREid>(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)\

ANDid>=$startANDid<=$end

sql_query_post_index=set@max_doc_id:=(SELECTmax_doc_idFROMsph_counterWHEREcounter_id=1)

sql_query_post_index=REPLACEINTOsph_counterSELECT2,IF($maxid,$maxid,@max_doc_id)

}

#主索引

indexitems{

source=items

path=/usr/local/sphinx/var/data/items

docinfo=extern

morphology=none

min_word_len=1

min_prefix_len=0

html_strip=1

html_remove_elements=style,script

ngram_len=1

ngram_chars=U+3000..U+2FA1F

charset_type=utf-8

charset_table=0..9,A..Z->a..z,_,a..z,U+410..U+42F->U+430..U+44F,U+430..U+44F

preopen=1

min_infix_len=1

}

#增量索引

indexitems_delta:items{

source=items_delta

path=/usr/local/sphinx/var/data/items-delta

}

#分布式索引

indexmaster{

type=distributed

local=items

local=items_delta

}

indexer{

mem_limit=256M

}

searchd{

listen=9312

listen=9306:mysql41#UsedforSphinxQL

log=/usr/local/sphinx/var/log/searchd.log

query_log=/usr/local/sphinx/var/log/query.log

compat_sphinxql_magics=0

attr_flush_period=600

mva_updates_pool=16M

read_timeout=5

max_children=0

dist_threads=2

pid_file=/usr/local/sphinx/var/log/searchd.pid

max_matches=1000

seamless_rotate=1

preopen_indexes=1

unlink_old=1

workers=threads#forRTtowork

binlog_path=/usr/local/sphinx/var/data

}保存退出

7、Sphinx创建索引

[php]view plaincopy

#第一次需重建索引:

[root@localhostbin]#./indexer-c/usr/local/sphinx/etc/sphinx.conf--all

Sphinx2.0.7-id64-release(r3759)

Copyright(c)2001-2012,AndrewAksyonoff

Copyright(c)2008-2012,SphinxTechnologiesInc(https://sphinxsearch.com)

usingconfigfile'/usr/local/sphinx/etc/sphinx.conf'...

indexingindex'items'...

collected8docs,0.0MB

sorted0.0Mhits,100.0%done

total8docs,1121bytes

total1.017sec,1101bytes/sec,7.86docs/sec

indexingindex'items_delta'...

collected0docs,0.0MB

total0docs,0bytes

total1.007sec,0bytes/sec,0.00docs/sec

skippingnon-plainindex'master'...

total4reads,0.000sec,0.7kb/callavg,0.0msec/callavg

total14writes,0.001sec,0.5kb/callavg,0.1msec/callavg

#启动sphinx

[root@localhostbin]#./searchd-c/usr/local/sphinx/etc/sphinx.conf

Sphinx2.0.7-id64-release(r3759)

Copyright(c)2001-2012,AndrewAksyonoff

Copyright(c)2008-2012,SphinxTechnologiesInc(https://sphinxsearch.com)

usingconfigfile'/usr/local/sphinx/etc/sphinx.conf'...

listeningonallinterfaces,port=9312

listeningonallinterfaces,port=9306

precachingindex'items'

precachingindex'items_delta'

rotatingindex'items_delta':success

precached2indexesin0.012sec

#查看进程

[root@localhostbin]#ps-ef|grepsearchd

root304311023:5900:00:00./searchd-c/usr/local/sphinx/etc/sphinx.conf

root3043230431023:5900:00:00./searchd-c/usr/local/sphinx/etc/sphinx.conf

root304371490023:59pts/000:00:00grepsearchd

#停止Searchd:

./searchd-c/usr/local/sphinx/etc/sphinx.conf--stop

#查看Searchd状态:

./searchd-c/usr/local/sphinx/etc/sphinx.conf--status

索引更新及使用说明

"增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话,也可以直接重新建立"主索引"。

API搜索的时,同时使用"主索引"和"增量索引",这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。

索引的更新与合并的操作可以放到cron job完成:

[php]view plaincopy

crontab-e

*/1****/usr/local/sphinx/shell/delta_index_update.sh

03***/usr/local/sphinx/shell/merge_daily_index.sh

crontab-l

cron job所用的shell脚本例子:

delta_index_update.sh:

[php]view plaincopy

#!/bin/bash

/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/sphinx.conf--rotateitems_delta>/dev/null2>&1merge_daily_index.sh:

[php]view plaincopy

#!/bin/bash

indexer=`whichindexer`

mysql=`whichmysql`

QUERY="usesphinx_items;selectmax_doc_idfromsph_counterwherecounter_id=2limit1;"

index_counter=$($mysql-h192.168.1.198-uroot-p123456-sN-e"$QUERY")

#merge"main+delta"indexes

$indexer-c/usr/local/sphinx/etc/sphinx.conf--rotate--mergeitemsitems_delta--merge-dst-rangedeleted00>>/usr/local/sphinx/var/index_merge.log2>&1

if["$"-eq0];then

##updatesphinxcounter

if[!-z$index_counter];then

$mysql-h192.168.1.198-uroot-p123456-Dsphinx_items-e"REPLACEINTOsph_counterVALUES(1,'$index_counter')"

fi

##rebuilddeltaindextoavoidconfusionwithmainindex

$indexer-c/usr/local/sphinx/etc/sphinx.conf--rotateitems_delta>>/usr/local/sphinx/var/rebuild_deltaindex.log2>&1

fi

8、php中文分词scws安装:注意扩展的版本和php的版本

[php]view plaincopy

wget-chttps://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2

tarjxvfscws-1.2.3.tar.bz2

cdscws-1.2.3

./configure--prefix=/usr/local/scws

make&&makeinstall

9、scws的PHP扩展安装:

[php]view plaincopy

cd./phpext

phpize

./configure

make&&makeinstall

echo"[scws]">>/etc/php.ini

echo"extension=scws.so">>/etc/php.ini

echo"scws.default.charset=utf-8">>/etc/php.ini

echo"scws.default.fpath=/usr/local/scws/etc/">>/etc/php.ini

10、词库安装:[php]view plaincopy

wgethttps://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2

tarjxvfscws-dict-chs-utf8.tar.bz2-C/usr/local/scws/etc/

chownwww:www/usr/local/scws/etc/dict.utf8.xdb

11、php使用Sphinx+scws测试例子

在Sphinx源码API中,有好几种语言的API调用.其中有一个是sphinxapi.php。

不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分。

测试用的搜索类Search.php:注意修改getDBConnection()信息为自己的

[php]view plaincopy

classSearch{

/**

*@varSphinxClient

**/

protected$client;

/**

*@varstring

**/

protected$keywords;

/**

*@varresource

**/

privatestatic$dbconnection=null;

/**

*Constructor

**/

publicfunction__construct($options=array()){

$defaults=array(

'query_mode'=>SPH_MATCH_EXTENDED2,

'sort_mode'=>SPH_SORT_EXTENDED,

'ranking_mode'=>SPH_RANK_PROXIMITY_BM25,

'field_weights'=>array(),

'max_matches'=>1000,

'snippet_enabled'=>true,

'snippet_index'=>'items',

'snippet_fields'=>array(),

);

$this->options=array_merge($defaults,$options);

$this->client=newSphinxClient();

//$this->client->setServer("192.168.1.198",9312);

$this->client->setMatchMode($this->options['query_mode']);

if($this->options['field_weights']!==array()){

$this->client->setFieldWeights($this->options['field_weights']);

}

/*

if(in_array($this->options['query_mode'],[SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED])){

$this->client->setRankingMode($this->options['ranking_mode']);

}

*/

}

/**

*Query

*

*@paramstring$keywords

*@paraminteger$offset

*@paraminteger$limit

*@paramstring$index

*@returnarray

**/

publicfunctionquery($keywords,$offset=0,$limit=10,$index='*'){

$this->keywords=$keywords;

$max_matches=$limit>$this->options['max_matches']$limit:$this->options['max_matches'];

$this->client->setLimits($offset,$limit,$max_matches);

$query_results=$this->client->query($keywords,$index);

if($query_results===false){

$this->log('error:'.$this->client->getLastError());

}

$res=[];

if(empty($query_results['matches'])){

return$res;

}

$res['total']=$query_results['total'];

$res['total_found']=$query_results['total_found'];

$res['time']=$query_results['time'];

$doc_ids=array_keys($query_results['matches']);

unset($query_results);

$res['data']=$this->fetch_data($doc_ids);

if($this->options['snippet_enabled']){

$this->buildExcerptRows($res['data']);

}

return$res;

}

/**

*customsorting

*

*@paramstring$sortBy

*@paramint$mode

*@returnbool

**/

publicfunctionsetSortBy($sortBy='',$mode=0){

if($sortBy){

$mode=$mode:$this->options['sort_mode'];

$this->client->setSortMode($mode,$sortBy);

}else{

$this->client->setSortMode(SPH_SORT_RELEVANCE);

}

}

/**

*fetchdatabasedonmatcheddoc_ids

*

*@paramarray$doc_ids

*@returnarray

**/

protectedfunctionfetch_data($doc_ids){

$ids=implode(',',$doc_ids);

$queries=self::getDBConnection()->query("SELECT*FROMitemsWHEREidin($ids)",PDO::FETCH_ASSOC);

returniterator_to_array($queries);

}

/**

*buildexcerptsfordata

*

*@paramarray$rows

*@returnarray

**/

protectedfunctionbuildExcerptRows(&$rows){

$options=array(

'before_match'=>'',

'after_match'=>'',

'chunk_separator'=>'...',

'limit'=>256,

'around'=>3,

'exact_phrase'=>false,

'single_passage'=>true,

'limit_words'=>5,

);

$scount=count($this->options['snippet_fields']);

foreach($rowsas&$row){

foreach($rowas$fk=>$item){

if(!is_string($item)||($scount&&!in_array($fk,$this->options['snippet_fields'])))continue;

$item=preg_replace('/[\r\t\n]+/','',strip_tags($item));

$res=$this->client->buildExcerpts(array($item),$this->options['snippet_index'],$this->keywords,$options);

$row[$fk]=$res===false$item:$res[0];

}

}

return$rows;

}

/**

*databaseconnection

*

*@returnresource

**/

privatestaticfunctiongetDBConnection(){

$dsn='mysql:host=192.168.1.198;dbname=sphinx_items';

$user='root';

$pass='123456';

if(!self::$dbconnection){

try{

self::$dbconnection=newPDO($dsn,$user,$pass);

}catch(PDOException$e){

die('Connectionfailed:'.$e->getMessage());

}

}

returnself::$dbconnection;

}

/**

*Chinesewordssegmentation

*

**/

publicfunctionwordSplit($keywords){

$fpath=ini_get('scws.default.fpath');

$so=scws_new();

$so->set_charset('utf-8');

$so->add_dict($fpath.'/dict.utf8.xdb');

//$so->add_dict($fpath.'/custom_dict.txt',SCWS_XDICT_TXT);

$so->set_rule($fpath.'/rules.utf8.ini');

$so->set_ignore(true);

$so->set_multi(false);

$so->set_duality(false);

$so->send_text($keywords);

$words=[];

$results=$so->get_result();

foreach($resultsas$res){

$words[]='('.$res['word'].')';

}

$words[]='('.$keywords.')';

returnjoin('|',$words);

}

/**

*getcurrentsphinxclient

*

*@returnresource

**/

publicfunctiongetClient(){

return$this->client;

}

/**

*logerror

**/

publicfunctionlog($msg){

//logerrorshere

//echo$msg;

}

/**

*magicmethods

**/

publicfunction__call($method,$args){

$rc=newReflectionClass('SphinxClient');

if(!$rc->hasMethod($method)){

thrownewException('invalidmethod:'.$method);

}

returncall_user_func_array(array($this->client,$method),$args);

}

}

测试文件test.php:

[php]view plaincopy

require('Search.php');

$s=newSearch([

'snippet_fields'=>['title','content'],

'field_weights'=>['title'=>20,'content'=>10],

]);

$s->setSortMode(SPH_SORT_EXTENDED,'createddesc,@weightdesc');

//$s->setSortBy('createddesc,@weightdesc');

$words=$s->wordSplit("mysql集群");//先分词结果:(mysql)|(mysql集群)

//print_r($words);exit;

$res=$s->query($words,0,10,'master');

echo'

';print_r($res);

12、SphinxQL测试

要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置)。

[php]view plaincopy

[root@localhostbin]#mysql-h127.0.0.1-P9306-uroot-p

Enterpassword:

WelcometotheMySQLmonitor.Commandsendwith;or\g.

YourMySQLconnectionidis1

Serverversion:2.0.7-id64-release(r3759)

Copyright(c)2000,2016,Oracleand/oritsaffiliates.Allrightsreserved.

OracleisaregisteredtrademarkofOracleCorporationand/orits

affiliates.Othernamesmaybetrademarksoftheirrespective

owners.

Type'help;'or'\h'forhelp.Type'\c'toclearthecurrentinputstatement.

mysql>showglobalvariables;

+----------------------+---------+

|Variable_name|Value|

+----------------------+---------+

|autocommit|1|

|collation_connection|libc_ci|

|query_log_format|plain|

|log_level|info|

+----------------------+---------+

4rowsinset(0.00sec)

mysql>descitems;

+---------+-----------+

|Field|Type|

+---------+-----------+

|id|bigint|

|title|field|

|content|field|

|created|timestamp|

|deleted|bool|

+---------+-----------+

5rowsinset(0.00sec)

mysql>select*frommasterwherematch('mysql集群')limit10;

+------+---------+---------+

|id|created|deleted|

+------+---------+---------+

|1|2016|0|

|6|0|0|

+------+---------+---------+

2rowsinset(0.00sec)

mysql>showmeta;

+---------------+-------+

|Variable_name|Value|

+---------------+-------+

|total|2|

|total_found|2|

|time|0.006|

|keyword[0]|mysql|

|docs[0]|5|

|hits[0]|15|

|keyword[1]|集|

|docs[1]|3|

|hits[1]|4|

|keyword[2]|群|

|docs[2]|3|

|hits[2]|4|

+---------------+-------+

12rowsinset(0.00sec)

mysql>

本文由职坐标整理并发布,希望对同学们有所帮助。了解更多详情请关注职坐标编程语言PHP频道!

本文由 @小标 发布于职坐标。未经许可,禁止转载。
喜欢 | 1 不喜欢 | 0
看完这篇文章有何感觉?已经有1人表态,100%的人喜欢 快给朋友分享吧~
评论(0)
后参与评论

您输入的评论内容中包含违禁敏感词

我知道了

助您圆梦职场 匹配合适岗位
验证码手机号,获得海同独家IT培训资料
选择就业方向:
人工智能物联网
大数据开发/分析
人工智能Python
Java全栈开发
WEB前端+H5

请输入正确的手机号码

请输入正确的验证码

获取验证码

您今天的短信下发次数太多了,明天再试试吧!

提交

我们会在第一时间安排职业规划师联系您!

您也可以联系我们的职业规划师咨询:

小职老师的微信号:z_zhizuobiao
小职老师的微信号:z_zhizuobiao

版权所有 职坐标-一站式IT培训就业服务领导者 沪ICP备13042190号-4
上海海同信息科技有限公司 Copyright ©2015 www.zhizuobiao.com,All Rights Reserved.
 沪公网安备 31011502005948号    

©2015 www.zhizuobiao.com All Rights Reserved

208小时内训课程